4M Tokens? MiniMax-Text-01 Outperforms DeepSeek V3


Chinese AI labs are making steady progress in the AI race. Models like DeepSeek-V3 and Qwen 2.5 are giving tough competition to GPT-4o, Claude, and Grok. How are these Chinese models better? They stand out for their cost efficiency, openness, and high performance. Many are open-source and available under commercially permissive licenses, making them accessible to a wide range of developers and businesses.

MiniMax-Text-01 is the latest addition to the Chinese LLMs. With a 4 million token context length—far exceeding industry standards of 128K-256K tokens—it sets a new benchmark in handling long-context tasks. The model’s Hybrid Attention architecture ensures operational efficiency, and its open-source, commercially permissive license empowers innovation without the burden of hefty costs.

Let’s explore MiniMax-Text-01!

Hybrid Architecture

MiniMax-Text-01 combines Lightning AttentionSoftmax Attention, and Mixture-of-Experts (MoE) to achieve a balance between efficiency and performance.

Source: MiniMax-Text-01
  • 7/8 Linear Attention (Lightning Attention-2):
    • Lightning Attention is a linear attention mechanism that reduces computational complexity from O(n²d) to O(d²n), making it highly efficient for long-context tasks.
    • The mechanism involves:
      1. Input transformation using SiLU activation.
      2. Matrix operations to compute attention scores.
      3. Normalization and scaling using RMSNorm and sigmoid.
  • 1/8 Softmax Attention:
    • Traditional attention with RoPE (Rotary Position Embedding) applied to half the attention head dimension, enabling length extrapolation without performance degradation.

Mixture-of-Experts (MoE) Strategy

MiniMax-Text-01 employs a unique MoE architecture that differs from models like DeepSeek-V3:

Source: MiniMax-Text-01
  • Token Drop Strategy: Uses an auxiliary loss to balance token distribution across experts, unlike DeepSeek’s dropless strategy.
  • Global Router: Optimizes token allocation to ensure balanced workloads across expert groups.
  • Top-k Routing: Selects top-2 experts per token, compared to DeepSeek’s top-8 + 1 shared expert.
  • Expert Configuration:
    • 32 experts (vs. DeepSeek’s 256 + 1 shared).
    • Expert Hidden Dimension: 9216 (vs. DeepSeek’s 2048).
    • Total Activated Parameters per Layer: 18,432 (same as DeepSeek).

Training and Scaling Strategies

  • Training Infrastructure:
    • Trained on ~2000 H100 GPUs using advanced parallelism techniques like Expert Tensor Parallelism (ETP) and Linear Attention Sequence Parallelism Plus (LASP+).
    • Optimized for 8-bit quantization, ensuring efficient inference on 8x80GB H100 nodes.
  • Training Data:
    • Trained on ~12 trillion tokens with a WSD-like learning rate schedule.
    • Data includes a mix of high-quality and low-quality sources, with global deduplication and 4x repetition for high-quality data.
  • Long-Context Training:
    • Three Phases:
      1. Main Training: 8k context length with RoPE base 10k.
      2. Phase 1: 128k context length, 5M RoPE base, 30% short sequences, 70% medium sequences.
      3. Phase 2: 512k context length, 10M RoPE base, 35% short, 35% medium, 30% long sequences.
      4. Phase 3: 1M context length, 10M RoPE base, 30% short, 30% medium, 40% long sequences.
    • Linear Interpolation: Mitigates distribution shifts during context length scaling.

Post-Training Optimization

  • Iterative Fine-Tuning:
    • Combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in cycles.
    • RL uses Offline DPO and Online GRPO for alignment.
  • Long-Context Fine-Tuning:
    • Short-Context SFT → Long-Context SFT → Short-Context RL → Long-Context RL.
    • This phased approach is critical for achieving superior long-context performance.

Key Innovations

  • DeepNorm: A post-norm architecture that scales residual connections and improves stability during training.
  • Batch Size Warmup: Gradually increases batch size from 16M to 128M tokens to optimize training dynamics.
  • Efficient Parallelism:
    • Ring Attention: Reduces memory overhead for long sequences.
    • Padding Optimization: Minimizes wasted computation during training.

Core Academic Benchmarks

Source: MiniMax-Text-01

General Tasks Benchmarks

TaskGPT-4oClaude-3.5-SonnetGemini-1.5-ProGemini-2.0-FlashQwen2.5-72B-Inst.DeepSeek-V3Llama-3.1-405B-Inst.MiniMax-Text-01
MMLU*85.788.386.886.586.188.588.688.5
MMLU-Pro*74.478.075.876.471.175.973.375.7
SimpleQA39.028.123.426.610.324.923.223.7
C-SimpleQA64.656.859.463.352.264.854.767.4
IFEval (avg)84.190.189.488.487.287.386.489.1
Arena-Hard92.487.685.372.781.291.463.589.1

Reasoning Tasks Benchmarks

TaskGPT-4oClaude-3.5-SonnetGemini-1.5-ProGemini-2.0-FlashQwen2.5-72B-Inst.DeepSeek-V3Llama-3.1-405B-Inst.MiniMax-Text-01
GPQA*46.065.059.162.149.059.150.754.4
DROP*89.288.889.289.385.091.092.587.8

Mathematics & Coding Tasks Benchmarks

TaskGPT-4oClaude-3.5-SonnetGemini-1.5-ProGemini-2.0-FlashQwen2.5-72B-Inst.DeepSeek-V3Llama-3.1-405B-Inst.MiniMax-Text-01
GSM8k*95.696.995.295.495.896.796.794.8
MATH*76.674.184.683.981.884.673.877.4
MBPP +76.275.175.475.977.078.873.071.7
HumanEval90.293.786.689.686.692.189.086.9
Source: MiniMax-Text-01

You can checkout other evaluation parameters here.

Let’s Get Started with MiniMax-Text-01

This script sets up and runs the MiniMax-Text-01 language model using the Hugging Face transformers library. It includes steps to configure the model for multi-GPU environments, apply quantization for efficiency, and generate responses from a user-provided input prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, GenerationConfig

# Ensure QuantoConfig is imported or defined
try:
    from transformers import QuantoConfig
except ImportError:
    class QuantoConfig:
        def __init__(self, weights, modules_to_not_convert):
            self.weights = weights
            self.modules_to_not_convert = modules_to_not_convert

# Load Hugging Face config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01", trust_remote_code=True)

# Quantization config (int8 recommended)
quantization_config = QuantoConfig(
    weights="int8",
    modules_to_not_convert=[
        "lm_head",
        "embed_tokens",
    ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
    + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
)

# Set device map for multi-GPU setup
world_size = 8  # Assume 8 GPUs
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01")

# Prepare input prompt
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
if hasattr(tokenizer, 'apply_chat_template'):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
else:
    raise NotImplementedError("The tokenizer does not support 'apply_chat_template'. Check the documentation or update the tokenizer version.")

# Tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Load model with quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    trust_remote_code=True,
    offload_buffers=True,
)

# Generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

End Note

MiniMax-Text-01 is a highly capable model with state-of-the-art performance in long-context and general-purpose tasks. While it has some areas for improvement, its open-source nature, cost efficiency, and innovative architecture make it a strong contender in the AI landscape. It’s particularly well-suited for applications requiring extensive memory and complex reasoning, but may need further refinement for coding-specific tasks.

Stay tuned to Analytics Vidhya News for more such insightful content!

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.



Source link

Leave a comment

All fields marked with an asterisk (*) are required