Sparse Mixture of Experts (SMoE) Overview

With Deepseek-V3 and R1, there’s been a resurgence of talk about sparse Mixture of Expert models, including many misconceptions. This architecture is not new but still very misunderstood by laypeople. Here’s an excerpt from an article I wrote last year concerning Mistral AI’s Mixtral 8x7b model. It’s a little disjointed because the article was about merging (the craze at the time).

tl;dr: sMoE models like DeepSeek V3/R1 can run faster and on cheaper hardware because the model weights are divided into different groups. This means there are fewer parameters activated, or calculations used to predict next tokens.

This speed and efficiency come with tradeoffs: While sMoE models activate fewer parameters per token, they typically don’t outperform dense models with equivalent active parameter counts. Their strength lies in scaling large models efficiently—not in superior per-parameter performance.

History

Until recently, users running local LLMs had a choice between smaller, faster, less resource intensive but less capable models like Mistral 7B and Llama 2 13B and larger, slower, resource-intensive but more powerful models like Llama 2 70B. Now, Mistral’s latest offering, Mixtral, attempts to bridge the gap between these two extremes using an architecture known as sparse Mixture of Experts (or sMoE), which is capable of generating better output than smaller models while requiring less expensive hardware than their larger counterparts.

Although its implementation is recent, the concept of MoE isn’t new, with the first paper on the subject, ‘Adaptive Mixtures of Local Experts about MoE’, published in 1991. This architecture involves training a special model, or router, and several discrete models, known as experts, simultaneously and on the same dataset.

Experts are distinct submodules—typically MLPs—within a shared model architecture. The router selects a small number of them to run per token. They’re not fully independent models, but they function somewhat like smaller, specialized units within the larger system, each contributing in different ways depending on the input.

This approach allows for the efficient allocation of computational resources where each discrete but connected “expert” model learns both shared general and specific uniquely grouped knowledge, and the router selects the appropriate experts to activate per token and then synthesize their expert activations to generate a unified response.

Misconceptions

It may sound like each expert is trained in distinct knowledge areas, such as language, math, logic, facts, or creative content or that each expert in an MoE model is trained separately on specific data. This is a common misconception even in the hobbyist AI community.

In reality, both the router and experts are trained concurrently on the same data. Each expert is trained on the same data, not on separate datasets or domains. But because the router chooses different experts for different tokens during training, they end up focusing on different patterns. It’s similar to a classroom where all students are taught the same material, but each absorbs and retains different parts depending on their strengths or interests.

So while no expert is assigned a specific topic like “math” or “language,” some naturally become better at handling certain types of input just through repeated exposure. Their specialization is not in separate domains as commonly perceived, and there is considerable (and in fact necessary) overlap in their learning. This is akin to a classroom where all students learn the same material, but each picks up more in particular areas of interest or strengths.

After training, the teacher (router) and all graduated students (experts) form the model. The router (now in the role of a team leader) knows which experts have gained expertise in particular areas. During inference, the router selects the most appropriate experts for each token, integrating their input before deciding on the final output.

However, this is an oversimplification. Specialization in MoE models often relates more to processing efficiency for certain data patterns rather than discrete domain expertise. Removing and replacing experts with those not involved in the original training disrupts this synergy, as their knowledge doesn’t complement the existing experts, and the router cannot effectively determine the best experts to consult.

Consider the analogy of a sports coach training a team. The coach and players, trained together, possess overlapping knowledge and specific skills for their positions. If the coach and some players are suddenly replaced with individuals from different sports, the cohesion is lost. The new coach wouldn’t know how to best allocate players, and the players would be unfamiliar with the game’s rules. Even if the coach adapts to the new players’ strengths and weaknesses, the team won’t function as cohesively as before.

Like our teacher and student analogy, this too is an oversimplification to help understanding the collaborative aspect of MoE models. In reality, the router functions more like a gatekeeper or a selector, deciding which experts are best suited for the task at hand based on their training and specialties.

Experts in an MoE model are not separate entities with distinct domain knowledge. Instead, MoE architecture optimizes how training data is distributed among various experts. This approach is not about dividing knowledge into specific domains but about grouping related data to reduce computational demands and enhance performance. MoE models also facilitate more efficient scaling of model size and complexity, enabling the training of larger models than would be feasible with traditional, dense architectures.

Key Takeaways

An MoE router and experts are trained together and on the same datasets, not separately or on different datasets.
Experts form an integral cohesive unit, rather than being separate entities with distinct domain knowledge.
Expert specialization emerges from experience, not predetermined designation.
The router is a critical component of the system, trained in tandem with the experts.

AightBits