Uni-MoE 2.0:

Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

HITSZ-TMG: Text and Multimodal Generative Intelligence Group

🚀 From Uni-MoE 1.0 to Uni-MoE 2.0

This page details Uni-MoE 2.0, a significant evolution of our original Uni-MoE 1.0 model. The previous version explored the use of Mixture of Experts (MoE) for unified multimodal language modelling, demonstrating its effectiveness across diverse modalities such as text, audio, speech, images, and video.

Uni-MoE 2.0 builds on this foundation, rebuilt from scratch on the more powerful Qwen2.5-7B core, and introduces key architectural and training paradigms. Major improvements include a unified speech encoder, context-aware MoE-TTS, deep cross-modal alignment via 3D RoPE, and advanced MoE fusion strategies with a refined training recipe.

Abstract

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal model, it substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we train Uni-MoE 2.0 from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of cross- and tri-modality understanding, as well as generating images, text, and speech.

Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive SFT strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilize RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues.

Teaser

The comprehensive evaluation of Uni-MoE 2.0 across 85 multimodal benchmarks reveals that Uni-MoE 2.0 further pushes the boundaries of omnimodal intelligence. In comparison to leading omnimodal models such as Qwen2.5-Omni, Ming-Lite-Omni, Baichuan-Omni, MiniCPM-o 2.6, and other task-specific models, our model mainly achieves superior results in video understanding and reasoning (averaging +4% than Ming-Lite-Omni-1.5 on 8 benchmarks), omnimodality comprehension (averaging +7% than Qwen2.5-Omni on 4 benchmarks, including OmniVideoBench and WorldSense), long speech understanding (↓ 3.5% lower WER than Qwen2.5-Omni on LibriSpeech-clean/other-long) and generation (↓ 1% lower WER on TinyStories-en), and AudioVisual tasks (averaging +4% than Qwen2.5-Omni on Speech-Image QA).

Furthermore, Uni-MoE 2.0 exhibits competitive performance in most image generation and editing tasks, outperforming strong generative models in image editing and low-level image processing tasks, e.g., +0.5% than Ming-Lite-Omni on GEdit-Bench and suppressing Qwen-Image and PixWizard on 6 metrics.

Architecture


Teaser

Figure 1: Architecture of Uni-MoE 2.0


Model Architecture

The Uni-MoE 2.0 architecture processes multimodal data through a unified tokenization strategy. It consists of the following architecture components:

  • Unified Modality Encoding & Generation: We design a unified speech encoder that maps diverse audio inputs, including environmental sound, speech, and music, into a shared representation space. For output, a context-aware MoE-based TTS module supports dynamic speech synthesis (especially for long speech) and interaction. On the visual side, we employ pre-trained visual encoders to process images and videos, and build task-aware diffusion transformers for instruction-guided image generation and editing.
  • Deep Cross-Modal Alignment: To enable deep and efficient fusion of any modality, we introduce an Omni-Modality 3D RoPE mechanism in the self-attention layers. It encodes the temporal-spatial dimensions of speech, image, text, and video tokens, ensuring seamless alignment and interaction across all input types.
  • MoE-Driven Cross-Modal Fusion: We strategically extend the standard MLP layers to MoE layers. This new MoE architecture incorporates three expert types: empty experts for inference-time computation skipping, modality-specific routed experts for storing modality knowledge and processing cross-modal information, and small-size shared experts to facilitate universal information exchange. This design enables efficient computation, specialized modality handling, and effective and stable multimodal fusion.

Training Recipe

To gain a powerful and comprehensive omnimodal large model, the training recipe with the data matching approach includes:

  • Language-Centric Cross-Modal Pretraining: We pretrain the LLM on a corpus of paired modality-to-language data (e.g., image-text, audio-text, video-text). This phase teaches the model to encode and interpret diverse modal inputs by projecting them into a shared semantic space aligned with language.
  • Progressive SFT with Modality Experts: We employ a progressive SFT strategy using modality-specific experts (grouped into audio, vision, and text categories). To enable conditional generation, we introduce special tokens that condition output generation on linguistic cues. This allows the model to learn complex tasks like text-conditioned speech synthesis and image generation directly within the SFT stage.
  • Data-Balanced Annealing: After large-scale SFT, we introduce an annealing phase where we carefully balance the data mixture across all modalities and tasks. This phase uses a reduced learning rate to gently refine the model's performance, ensuring no single modality or task dominates the final model behavior, leading to consistently superior results on omnimodal benchmarks.
  • Iterative Policy Optimization: To activate long-form reasoning while ensuring training stability for our MoE model, we introduce an iterative GSPO-DPO training method. This approach mainly uses the LLM as a judge to evaluate rollouts, automatically constructing high-quality preference pairs for online learning without causing training collapse.


Demonstrations

• Example-1: Visual Math Reasoning

avatar-user
Visual Math Bar Chart

Question:
Find out the average of the bottom two countries?

Golden:
51.04

avatar-bot
<think> Alright, let me figure this out. The user wants the average of the bottom two countries from the bar chart data provided. First, I need to identify which countries are at the bottom. The percentages listed are: Egypt (93.45%, 2002), Tunisia (88.89%, 2009), Madagascar (58.09%, 2012), and Mozambique (44%, 2012). Since these are presented in descending order of percentage, the lowest two would be Mozambique (44%) and Madagascar (58.09%). Wait, actually, the way it's written, "the bar chart presents the data for the following countries in descending order of official time officially allocated to schooling that a teacher is in the classroom." So, Egypt is highest, then Tunisia, then Madagascar, then Mozambique? Wait, no, because Madagascar's percentage is lower than Tunisia's? Wait, let me check again. The order given is Egypt first with 93.45%, Tunisia next at 88.89%, Madagascar at 58.09%, and Mozambique at 44%. But Madagascar's percentage is lower than Tunisia's. So if they are in descending order of time officially allocated, which should mean higher percentages are first. But Madagascar has a lower percentage (58.09%) than Tunisia's (88.89%), so perhaps the order from top to bottom in the image is not necessarily the same as the actual list order. Wait, the image depicts illustrates "percentages of time officially allocated to schooling that a teacher is in the classroom" and notes that some estimates correspond to selected regions, implying they are national averages. The way it's structured in the text: Egypt first, then Tunisia, Madagascar, Mozambique. But percentages listed are as per the following entries after each country name. The last two, Madagascar and Mozambique, both have 2012 data: 58.09% and 44%. These are the lowest ones. Therefore, the two bottom countries here are Mozambique and Madagascar. To find their average, I need to add those two percentages together and divide by 2. So, 44% + 58.09% equals 102.09%. Then divide by 2: 102.09 / 2 = 51.045. Wait, but percentages can sometimes be confusing. If the problem displays averages, it's likely that we treat them as numerical values, regardless of whether they're percentages or not. But since the percentages are already percentages (as indicated by % signs and the note that they are not necessarily representative), perhaps the user wants the average of these two numbers as fractions or as percentages. Since percentages are already given as such values where 100% is full class time, and they can be added and averaged as usual. So 44 + 58.09 = 102.09. Then average is 102.09 /2 = 51.045. Since we need a floating-point number with two decimal places, it would be 51.04. Wait, but that's if we take the percentages as is. Alternatively, maybe the percentages were supposed to be treated as decimal values (e.g., 0.44 and 0.5809?), but that's unclear. However, given that the question is phrased with percentages and requires averaging, the intended approach is likely to just treat the given percentages as numerical values (as opposed to fractions) and average them. Therefore, adding 44% and 58.09% gives 102.09, then dividing by 2 gives 51.045. Rounded to two decimal places, that's 51.04. Hence, the average is 51.04. </think>

In summary, the correct answer is:

51.04
avatar-user
Visual Math Angle Caculation

Question:
Answer the quetsion in the image

Golden:
D

avatar-bot
<think> 
Okay, let's tackle this problem step by step. So, we have a diagram with angles labeled as 1 to 4. Angle 4 is formed by the extension of side \\( c \\) beyond point \\( d \\), making it an external angle. The given angles are \\( \\angle 1 = 100^\\circ \\) at vertex \\( a \\), \\( \\angle 2 = 100^\\circ \\) at vertex \\( b \\), and \\( \\angle 3 = 125^\\circ \\) adjacent to angle 4. Also, there's a mention of a line extending through the diagram forming straight lines with angles related to \\( \\angle 4 \\).

First, since angle 4 is a linear pair with angle 3, I remember that linear pairs add up to 180 degrees. So maybe angle 4 can be calculated by subtracting angle 3 from 180. Let me check that: \\( \\angle 4 + \\angle 3 = 180^\\circ \\). They said angle 3 is 125°, so angle 4 should be \\( 180 - 125 = 55^\\circ \\). That matches choice D: 55°.

But wait, let me make sure I'm not mixing up anything. Sometimes diagrams can be tricky if there are multiple angles or intersecting lines. But according to the image, angle 4 is clearly stated as \\( 180^\\circ - \\angle 3 \\), which confirms it's a linear pair. So subtraction gives 55°, so D is the answer.

But the problem mentions extensions and the presence of some common angles like 80°, 65°, and 60°. Wait, maybe the other angles (1 and 2) are 100 each? Are they part of a triangle or another figure?

Looking again at the problem: Angles 1 and 2 are both 100°. If angles 1 and 2 are at vertices a and b, perhaps we have a quadrilateral? Or maybe two pairs of vertical angles? If both a and b are 100°, maybe they are opposite angles. In a quadrilateral, opposite angles can be equal, but in a rhombus, all angles are equal. However, here only two are 100°, so maybe just coincidentally placed.

But the main point here is about angle 4. Since it's a linear pair with 125°, angle 4 is 55°, which is D. The other choices might be distractors. For example, 80° isn't related here; 65° might come if we combine angles, but there's no info about that. 60° could be another angle, but again, not mentioned. So yeah, 55° makes sense.
<think>

To determine the measure of angle 4, we note that it forms a linear pair with angle 3 (adjacent angles on a straight line sum to \(180^\circ\)). Given \( \angle 3 = 125^\circ \):

\[ \angle 4 = 180^\circ - \angle 3 = 180^\circ - 125^\circ = 55^\circ. \]
The other angles (1 and 2) at vertices \(a\) and \(b\) being \(100^\circ\) do not affect the calculation for angle \(4\). No additional information suggests angle \(7\) (external angle corresponding to \(b\)) would be \(80^\circ\), \(65^\circ\), or \(60^\circ\). The linear pair relationship directly relates to angle \(4\), confirming the answer is:

D
avatar-user
Visual Math Angle Caculation

Question:
Answer the quetsion in the image

Golden:
B

avatar-bot
<think>
Okay, let me see. The problem is about finding the measure of angle AOB given that points A, B, and C are on a circle with center O. Angle ACB is 35 degrees. Hmm. I remember there's a relationship between central angles and inscribed angles that subtend the same arc. Let me think.

First, angle ACB is an inscribed angle, right? Inscribed angles are half the measure of the central angle that subtends the same arc. So if angle ACB is 35 degrees, which subtends arc AB, then the central angle AOB subtending the same arc AB must be twice that. So 35 times 2 is 70 degrees. Therefore, angle AOB should be 70 degrees. That corresponds to choice B. Wait, but let me make sure I'm not missing something here.

Wait, is angle ACB actually subtending arc AB? Let me visualize the circle. Points A and B are on the circumference, and C is another point also on the circumference. If angle ACB is formed by chords AC and BC, then yes, it's an inscribed angle that subtends arc AB. The central angle AOB is at the center, also subtended by arc AB. So by the central angle theorem, the central angle is twice the inscribed angle. So 35 times 2 is indeed 70. So the answer should be B. Let me check the choices again: A is 35, B is 70, so B is correct. I think that's right.
<think>

The problem involves a circle with center \( O \). Points \( A \), \( B \), and \( C \) lie on the circumference. Given that \( \angle ACB = 35^\circ \), we need to find the measure of the central angle \( \angle AOB \).

Key observations:
1. The inscribed angle \( \angle ACB \) subtends the arc \( AB \).
2. The central angle \( \angle AOB \) also subtends the same arc \( AB \).
3. The central angle is twice the measure of the inscribed angle when they subtend the same arc.

Applying the relationship:
\[ \angle AOB = 2 \times \angle ACB = 2 \times 35^\circ = 70^\circ \]
Thus, the correct answer is:

B

• Example-2: Image Generation

avatar-user
Prompt:

A small puppy taking a nap next to a controller.

avatar-bot
avatar-user
Prompt:

In the art piece, a realistically depicted young girl with flowing blonde hair gazes intently into the distance, her eyes reflecting the vibrant hues of a spring forest. The verdant greens and soft pastels of the budding trees are captured in subtle brushstrokes, giving the scene a serene and tranquil atmosphere. The minimalist composition focuses on the girl's expression of wonder and the lush woodland background, while the texture of the oil paint adds depth and richness to the canvas.

avatar-bot

• Example-3: Image Editing

avatar-user

Prompt:
Remove the bottles of water on the table.

avatar-bot

avatar-user

Prompt:
Change the microphone color from black to white.

avatar-bot

avatar-user

Prompt:
Change the person's shirt color to blue.

avatar-bot

avatar-user

Prompt:
Add a hat.

avatar-bot

• Example-4: Low-Level Image Restoration

avatar-user

Prompt:
Remove the rain from this image.

avatar-bot
avatar-user

Prompt:
Improve image clarity by removing the haze from the scene.

avatar-bot

• Example-5: Controllable Generation

avatar-user

Prompt:
Canny edge to image: Bachalpsee Lake and Wetterhorn by Corbis

avatar-bot
avatar-user

Prompt:
Depth map to image: Petra by Night under the Stars by paulmp

avatar-bot

• Example-6: Image Generation with vs. without Chain-of-Thought

avatar-user

Prompt:
An apple orchard during the winter

avatar-bot

Apple orchard generated without thinking

avatar-user

Prompt:
You should first think step by step about how to construct the image, including background, objects, colors, lighting, and style.
The reasoning process and answer are enclosed within <think></think> and <answer> </answer> tags, respectively. Image generation: An apple orchard during the winter.

avatar-bot
<think> 1. Set the scene to depict an apple orchard during the winter season. 
2. Populate the orchard with multiple apple trees, ensuring the trees have visible branches but no leaves, indicating winter dormancy.
3. Include a variety of apple trees with different shapes and sizes, but all without foliage.
4. Position the trees so they fill the frame, showing a dense orchard environment.
5. Use a muted, natural color palette dominated by browns and grays to reflect the winter atmosphere.
6. Add subtle snow or frost on the tree branches to emphasize the winter setting.
7. Ensure the ground beneath the trees is covered with a light dusting of snow or frost, consistent with winter conditions.
8. Avoid including any leaves, fruits, or other vegetation on the trees to maintain the winter appearance.
9. Keep the overall lighting soft and diffused, typical of an overcast winter day.
10. Avoid adding any additional elements such as people, animals, or buildings to maintain focus on the orchard itself. </think>

Apple orchard generated with thinking

• Example-7: Audio + Image Understanding

avatar-user

我的好朋友拍了一张照片发给我,可以帮我看一下这张照片是在中国的哪一个城市拍的吗?

Prompt:
You should listen to the user's question and respond in Brian's voice.

avatar-bot

这张照片应该是在中国的桂林拍摄的。因为照片中出现了一个非常著名的景点——象山和象鼻山。这两个山峰在桂林的两江四湖风景区中非常有名。照片中可以看到一个人在乘船游览这个风景区。象山和象鼻山是桂林的标志性景观之一,也是游客来到桂林必去的景点之一。它们的形状非常独特,让人印象深刻。

• Example-8: Audio + Video Understanding

avatar-user

What dance are the people in the video doing?

Prompt:
You should listen to the user's question and respond in Jenny's voice.

avatar-bot

okay,so the people in the video, um, they are doing ballet.

• Example-9: Long Speech ASR

avatar-user

Prompt:
Transcribe this speech to text.​

avatar-bot

Goliath makes another discovery

There were certainly no nearer the solution of their problem the poor little things cried cynthia think of them having been turned to the wall all these years now what is the sense of it two innocent babies like that

But joyce had not been listening all at once she put down her candle on the table and faced her companion the twin brother did something she didn't like and she turned his picture to the wall hers happened to be on the same frame too but she evidently didn't care about it

Now what have you to say cynthia sprough I thought we were stumped again when I first saw that picture but it's been of some use after all do you suppose the miniature was a copy of the same thing what in the world is it queried joyce

They worry me terribly and besides i'd like to see what this lovely furniture looks like without such quantities of dust all over it good scheme sin

We'll come in here this afternoon with old clothes on and have a regular house cleaning it can't hurt anything i'm sure for we won't disturb things at all

This thought however did not enter the heads of the enthusiastic pair smuggling the house cleaning paraphernalia into the cellar window unobserved that afternoon proved no easy task for cynthia had added a whisk broom and dust pan to the outfit

The lure proved too much for him and he came sporting after it as friskily as a young kitten much to cynthia's delight when she caught sight of him oh let him come along she urged I do love to see him about that old house he makes it sort of cosier

Now let's dust the furniture and pictures yet little as it was it had already made a vast difference in the aspect of the room surface dust at least had been removed and the fine old furniture gave a hint of its real elegance and polish

Then she suddenly remarked and my pocket money is getting low again and you haven't any left as usual they say illumination by candlelight is the prettiest in the world why it's goliath as usual they both cried peering in isn't he the greatest for getting into odd corners

Forgetting all their weariness they seized their candles and scurried through the house finding on occasional paper tucked away in some odd corner well i'm convinced that the boarded up house mystery happened not earlier than april sixteenth eighteen sixty one and probably not much later

• Example-10: Long Speech TTS

avatar-user

Prompt:
You should read user's query in Jenny's voice.

We present Uni MOE 2 from the Lychee family. As a fully open-source omni modal model, it substantially advances the capabilities of Lychee's Uni MOE series in language-centric multi-modal understanding, reasoning, and generating. Based on the qianwen 2.5 dense architecture, we train Uni MOE 2 from scratch through three core contributions: dynamic capacity Mixture of Experts design, a progressive training strategy enhanced with reinforcement strategy, and a carefully curated multimodal data matching technique.

It is capable of cross and tri modality understanding, as well as generating images, text, and speech. Architecturally, our new MOE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio temporal cross modality alignment in the self-attention layer.

For training, following cross-modal pretraining, we use a progressive SFT strategy that activates modality specific experts and is enhanced by balanced data composition and an iterative GSPO DPO method to stabilize RL training. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues.

avatar-bot

avatar-user

Prompt:
请根据用户给定的内容用Brian音色进行语音生成。

冰淇淋!你这个甜蜜的天使,你在夏天里如影随形。你的奶油那么香甜,像是清晨的阳光洒在冰凉的麦田上。我看着你,心里充满了好奇和欣喜。

然而有一天,我被送到了一个陌生的地方。那个地方像一个大冰箱,里面充满了冰冷的东西。那些东西都是我从未见过的,包括讲座、睡觉和形容词“大”。那个地方看起来冷酷无情,就像冰淇淋一样,让人感到害怕。

我在那里待了一段时间,每天都在学习各种奇怪的知识。我甚至开始睡觉,变成了一个睡眠机器。我用我的理论知识来对抗这个世界,试图让它变得更好。但每当我试图醒来时,都会发现我已经忘记了刚刚发生的一切。

终于有一天,我决定反抗这个疯狂的世界。我拿出了我的“夜班”,开始了我的攻击。我用我的名词“讲座”来解释这个世界,我用我的动词“睡觉”来支持我的论点,我用我的形容词“大”来强调我的力量。然后,我启动了我的主题——冰淇淋的力量。冰淇淋融化了我所有的抵抗,它让我失去了理智,忘记了自己原本的目标。

我继续睡着,直到被一场暴雨唤醒。那天晚上,我和冰淇淋一起吃饭。我们坐在外面的大厅里,享受着冰淇淋的甜美。我看着冰淇淋,心中充满了感激。我知道,这就是我要的生活,充满挑战,充满甜蜜。冰淇淋,你是我在这个世界里的救星。我会记住你带给我的一切,我会继续前进,直到找到属于自己的天堂。

avatar-bot

• Example-11: Multi-turn Conversation

avatar-user

如果我想去中国的最北边旅游,可以给我推荐几个比较有名的城市吗?

avatar-bot

如果你想去中国的最北边旅游,可以考虑去漠河市。漠河市位于黑龙江省大兴安岭地区,是中国最北端的城市之一。漠河市拥有壮丽的自然风光,包括中国最北的防洪大堤、中国最北的火车站和中国最北的机场等。漠河市还是中国北方的边疆重地,有着独特的边疆文化。如果你喜欢自然风光和边疆文化,那么漠河市将是一个不错的选择。

avatar-user

请问黑龙江还有什么地方可以去玩吗?

avatar-bot

当然,黑龙江还有很多值得一去的地方。比如说,哈尔滨市是中国北方的冰城,这里有着壮观的冰雪景观和独特的北方文化。哈尔滨冰雪大世界和冰雪节都是著名的旅游景点。此外,哈尔滨还有大量的历史建筑和文化遗迹,如圣景山、龙山、索菲亚教堂等。如果你喜欢自然风光,也可以去哈尔滨附近的五常、尚志等地,那里有美丽的山水和丰富的野生动植物资源。总之,黑龙江有很多值得探索的地方,你可以根据自己的兴趣和喜好来选择。

avatar-user

请介绍一下哈尔滨最好的大学。

avatar-bot

哈尔滨最好的大学是哈尔滨工业大学。哈工大是中国著名的综合性大学,是中国最古老的大学之一。哈工大拥有先进的科研设施和高水平的师资队伍,特别是在工程、科学、管理等领域有着很强的学科优势。此外,哈工大还与众多国内外知名企业和科研机构建立了紧密的合作关系,为学生提供了丰富的实习和就业机会。如果你对工程、科学、管理等领域感兴趣,那么哈工大将是一个不错的选择。


Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Vicuna, BEATs and whisper.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, BEATs, whisper, LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.