This page details Uni-MoE 2.0, a significant evolution of our original Uni-MoE 1.0 model. The previous version explored the use of Mixture of Experts (MoE) for unified multimodal language modelling, demonstrating its effectiveness across diverse modalities such as text, audio, speech, images, and video.
Uni-MoE 2.0 builds on this foundation, rebuilt from scratch on the more powerful Qwen2.5-7B core, and introduces key architectural and training paradigms. Major improvements include a unified speech encoder, context-aware MoE-TTS, deep cross-modal alignment via 3D RoPE, and advanced MoE fusion strategies with a refined training recipe.
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal model, it substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we train Uni-MoE 2.0 from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of cross- and tri-modality understanding, as well as generating images, text, and speech.
Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive SFT strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilize RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues.
The comprehensive evaluation of Uni-MoE 2.0 across 85 multimodal benchmarks reveals that Uni-MoE 2.0 further pushes the boundaries of omnimodal intelligence. In comparison to leading omnimodal models such as Qwen2.5-Omni, Ming-Lite-Omni, Baichuan-Omni, MiniCPM-o 2.6, and other task-specific models, our model mainly achieves superior results in video understanding and reasoning (averaging +4% than Ming-Lite-Omni-1.5 on 8 benchmarks), omnimodality comprehension (averaging +7% than Qwen2.5-Omni on 4 benchmarks, including OmniVideoBench and WorldSense), long speech understanding (↓ 3.5% lower WER than Qwen2.5-Omni on LibriSpeech-clean/other-long) and generation (↓ 1% lower WER on TinyStories-en), and AudioVisual tasks (averaging +4% than Qwen2.5-Omni on Speech-Image QA).
Furthermore, Uni-MoE 2.0 exhibits competitive performance in most image generation and editing tasks, outperforming strong generative models in image editing and low-level image processing tasks, e.g., +0.5% than Ming-Lite-Omni on GEdit-Bench and suppressing Qwen-Image and PixWizard on 6 metrics.
Figure 1: Architecture of Uni-MoE 2.0
The Uni-MoE 2.0 architecture processes multimodal data through a unified tokenization strategy. It consists of the following architecture components:
To gain a powerful and comprehensive omnimodal large model, the training recipe with the data matching approach includes:
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Vicuna, BEATs and whisper.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, BEATs, whisper, LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.