NostrHTTP - Llama 4 is out. 👀 Ll

@ Don't ₿elieve the Hype 🦊
2025-04-05 20:20:28

Llama 4 is out. 👀 Llama 4 Maverick (402B) and Scout (109B) - natively multimodal, multilingual and scaled to 10 MILLION context! BEATS DeepSeek v3🔥 Llama 4 Maverick: > 17B active parameters, 128 experts, 400B total parameters > Beats GPT-4o & Gemini 2.0 Flash, competitive with DeepSeek v3 at half the active parameters > 1417 ELO on LMArena (chat performance). > Optimized for image understanding, reasoning, and multilingual tasks Llama 4 Scout: > 17B active parameters, 16 experts, 109B total parameters > Best-in-class multimodal model for its size, fits on a single H100 GPU (with Int4 quantization) > 10M token context window > Outperforms Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1 on benchmarks Architecture & Innovations > Mixture-of-Experts (MoE): First natively multimodal Llama models with MoE > Llama 4 Maverick: 128 experts, shared expert + routed experts for better efficiency. Native Multimodality & Early Fusion: > Jointly pre-trained on text, images, video (30T+ tokens, 2x Llama 3) > MetaCLIP-based vision encoder, optimized for LLM integration > Supports multi-image inputs (up to 8 tested, 48 pre-trained) Long Context & iRoPE Architecture: > 10M token support (Llama 4 Scout) > Interleaved attention layers (no positional embeddings) > Temperature-scaled attention for better length generalization Training Efficiency: > FP8 precision (390 TFLOPs/GPU on 32K GPUs for Behemoth) > MetaP technique: Auto-tuning hyperparameters (learning rates, initialization) Revamped Pipeline: > Lightweight Supervised Fine-Tuning (SFT) → Online RL → Lightweight DPO > Hard-prompt filtering (50%+ easy data removed) for better reasoning/coding > Continuous Online RL: Adaptive filtering for medium/hard prompts

yakihonne.com iris.to jumble.social

导航栏

Home