EMO: Pretraining mixture of experts for emergent modularity

TL;DR

Researchers have introduced EMO, a new mixture-of-experts model that learns to organize its experts into coherent, domain-specific groups during pretraining. EMO allows users to activate only a small subset of experts for specific tasks while maintaining near full-model performance. This development could improve model efficiency and flexibility.

AI researchers have released EMO, a new mixture-of-experts (MoE) model that learns to organize its experts into coherent, domain-specific groups during pretraining, enabling efficient and flexible deployment without relying on human-defined priors.

EMO is a 1-billion-active, 14-billion-parameter MoE trained on 1 trillion tokens, designed to support selective expert use. Unlike traditional MoEs, which often require predefined domain labels and show degraded performance when only subsets of experts are used, EMO enables users to activate just 12.5% of its experts while retaining near full-model performance. When all experts are used, EMO functions as a strong general-purpose model.

The key innovation lies in the training process: EMO employs a routing strategy based on document boundaries, where all tokens within a document are constrained to select experts from a shared subset. This encourages the emergence of domain-specific expert groups directly from the training data, without manual domain labels or biases. The routing mechanism averages expert preferences across tokens in a document, selecting the most-used experts to form a shared pool, which allows different documents to develop distinct expert groups.

Why It Matters

This development matters because it addresses a major limitation of existing MoE models: the inability to reliably activate only relevant experts for a specific task or domain, which hampers efficiency and flexibility. EMO’s emergent modularity can reduce computational costs and improve model adaptability, enabling more practical deployment of large language models in resource-constrained environments. It also opens avenues for models that can dynamically organize themselves into specialized groups based on data, reducing reliance on human-defined labels and biases.

Modern Computer Architecture and Organization: A systems-level guide to modern computer architectures, from hardware foundations to AI datacenters

As an affiliate, we earn on qualifying purchases.

Background

Traditional large language models are monolithic, making it difficult to adapt or deploy them efficiently for specific tasks. Mixture-of-experts models have been proposed as a solution, but they often require predefined domain labels and show performance degradation when only a subset of experts is used. Previous approaches like BTX and FlexOlmo attempted to incorporate domain routing, but limitations remained. EMO builds on this by enabling emergent modularity through a novel training strategy that leverages document boundaries as weak supervision, thus allowing the model to naturally develop domain-specific expert groups during pretraining.

“EMO demonstrates that emergent modularity can be achieved during pretraining without human-defined priors, enabling flexible and efficient deployment.”

— Lead researcher from AllenAI

“By constraining tokens within documents to share expert pools, EMO encourages the formation of domain-specific groups that can be selectively activated.”

— AI researcher involved in the project

AI Value Creators: Beyond the Generative AI User Mindset

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is still unclear how well EMO’s emergent modularity generalizes across different datasets and real-world applications. The long-term stability of the expert groups and their interpretability remain to be validated. Additionally, the impact on downstream task performance and efficiency gains need further empirical testing in diverse deployment scenarios.

Model Building Tools Kit – 6-Piece with 4.3inch Precision Model Nipper, Clean Cuts with No Whitening, for Plastic Models, Gundam, Miniatures

【Complete 6-Piece Model Tools Kit】All-in-one hobby kit includes 1 high-quality single-edge nipper, 1 craft knife (hobby knife), 2…

As an affiliate, we earn on qualifying purchases.

What’s Next

Further research will evaluate EMO’s performance across various tasks and domains, exploring how well the emergent modules facilitate transfer learning and domain adaptation. Developers are expected to experiment with different routing strategies and scaling to larger models. Additional benchmarks and real-world deployment tests are anticipated in upcoming studies.

On Device AI Model Deployment: Running Open Source Large Models Efficiently On Edge Devices

As an affiliate, we earn on qualifying purchases.

Key Questions

How does EMO differ from traditional MoE models?

Unlike traditional MoEs that rely on predefined domain labels and often activate all experts, EMO uses a training mechanism based on document boundaries to encourage the emergence of domain-specific expert groups, enabling selective expert activation without manual labels.

Can EMO’s experts be interpreted or understood?

Emergent modularity suggests that experts may organize into meaningful groups, but their interpretability depends on further analysis. Currently, the focus is on performance and modularity emergence, with interpretability being an area for future research.

What are the practical benefits of EMO’s modularity?

EMO allows deploying only relevant subsets of experts for specific tasks, reducing computational costs and memory usage while maintaining high performance, making large models more accessible and adaptable.

Is EMO available for use now?

Yes, the researchers have released the model, code, and visualization tools publicly, enabling further experimentation and development.

EMO: Pretraining mixture of experts for emergent modularity

Up next

Unlocking asynchronicity in continuous batching

Author

The Genius Factory Team

Share article

Why It Matters

Modern Computer Architecture and Organization: A systems-level guide to modern computer architectures, from hardware foundations to AI datacenters

Background

AI Value Creators: Beyond the Generative AI User Mindset

What Remains Unclear

Model Building Tools Kit – 6-Piece with 4.3inch Precision Model Nipper, Clean Cuts with No Whitening, for Plastic Models, Gundam, Miniatures

What’s Next

On Device AI Model Deployment: Running Open Source Large Models Efficiently On Edge Devices

Key Questions

How does EMO differ from traditional MoE models?

Can EMO’s experts be interpreted or understood?

What are the practical benefits of EMO’s modularity?

Is EMO available for use now?

Energy Harvesting Wearables: Power From Motion and Heat

QAtrial Launches Enterprise-Ready Open-Source Quality Management Platform

Super Apps: One‑Stop Platforms for Daily Life

DNA Data Storage: Archiving the Digital World in Molecules

TechCrunch Mobility: The AI skills arms race is coming for automotive

For Eclipse, the $2.5B Cerebras win is just the start of realizing its physical-world thesis

WHO Declares Ebola Outbreak a Global Health Emergency

How Human-Robot Collaboration Is Evolving Fast

EMO: Pretraining mixture of experts for emergent modularity

Up next

Author

The Genius Factory Team

Share article

Why It Matters

Modern Computer Architecture and Organization: A systems-level guide to modern computer architectures, from hardware foundations to AI datacenters

Background

AI Value Creators: Beyond the Generative AI User Mindset

What Remains Unclear

Model Building Tools Kit – 6-Piece with 4.3inch Precision Model Nipper, Clean Cuts with No Whitening, for Plastic Models, Gundam, Miniatures

What’s Next

On Device AI Model Deployment: Running Open Source Large Models Efficiently On Edge Devices

Key Questions

How does EMO differ from traditional MoE models?

Can EMO’s experts be interpreted or understood?

What are the practical benefits of EMO’s modularity?

Is EMO available for use now?

You May Also Like