Topic: "multi-token-prediction"

nemotron-3-super gpt-oss-120b qwen3.5-122b-a10b nvidia perplexity replit base44 vllm llama.cpp ollama togethercompute baseten wandb langchain unsloth model-architecture model-optimization inference-speed kv-cache multi-token-prediction agent-infrastructure orchestration persistent-agents model-serving product-launches karpathy ctnzr bnjmn_marie artificialanlys

NVIDIA’s Nemotron 3 Super is a 120B parameter / ~12B active open model featuring a hybrid Mamba-Transformer / SSM Latent MoE architecture and 1M context window, delivering up to 2.2x faster inference than GPT-OSS-120B in FP4 with strong throughput gains. It supports agentic workloads and is unusually open with weights, data, and infrastructure details released. The model scored 36 on the AA Intelligence Index, outperforming GPT-OSS-120B but behind Qwen3.5-122B-A10B. Community and infrastructure support from projects like vLLM, llama.cpp, Ollama, Together, Baseten, W&B Inference, LangChain, and Unsloth GGUFs was immediate. Key technical innovations include native multi-token prediction (MTP) and a significant KV-cache efficiency advantage. On the product side, a shift towards persistent agent runtimes and orchestration layers is highlighted, with Andrej Karpathy advocating for a "bigger IDE" concept where agents replace files as the unit of work, enabling legible, forkable agentic organizations with real-time control. New launches fitting this vision include Perplexity’s Personal Computer, an always-on local/cloud hybrid running on Mac mini, and Computer for Enterprise orchestrating 20 specialized models and 400+ apps. Replit Agent 4 offers a collaborative, canvas-like workflow with parallel agents, while Base44 Superagents provide integrated solutions for nontechnical users. The engineering focus is increasingly on the orchestration harness rather than just the model.

Dec 16, 2025

OpenAI GPT Image-1.5 claims to beat Nano Banana Pro, #1 across all Arenas, but completely fails Vibe Checks

gpt-image-1.5 nano-banana-pro mimo-v2-flash deepseek-v3.2 openai gemini xiaomi lmsys deepseek openrouter image-generation instruction-following benchmarking model-efficiency long-context multi-token-prediction hybrid-attention model-optimization inference-speed agentic-workflows model-architecture model-quantization fuli_luo eliebakouch

OpenAI released its new image model GPT Image 1.5, featuring precise image editing, better instruction following, improved text and markdown rendering, and faster generation up to 4×. Despite topping multiple leaderboards like LMArena (1277), Design Arena (1344), and AA Arena (1272), user feedback from Twitter, Reddit, and Discord communities is largely negative compared to Nano Banana Pro by Gemini. Xiaomi introduced the MiMo-V2-Flash, a 309B MoE model optimized for inference efficiency with 256K context window, achieving state-of-the-art scores on SWE-Bench. The model uses Hybrid Sliding Window Attention and multi-token prediction, offering significant speedups and efficiency improvements. The timing of OpenAI's launch amid competition from Gemini and Nano Banana Pro affects user sentiment, highlighting challenges in benchmarking relevance.

Sep 11, 2025

Qwen3-Next-80B-A3B-Base: Towards Ultimate Training & Inference Efficiency

qwen3-next qwen3 mixtral-8x7b gemini-2.5-pro alibaba mistral-ai deepseek snowflake hugging-face baseten nvidia mixture-of-experts model-sparsity gated-attention hybrid-architecture rmsnorm model-stability model-training inference-optimization multi-token-prediction model-deployment justinlin610 teortaxestex yuchenj_uw

MoE (Mixture of Experts) models have become essential in frontier AI models, with Qwen3-Next pushing sparsity further by activating only 3.7% of parameters (3B out of 80B) using a hybrid architecture combining Gated DeltaNet and Gated Attention. This new design includes 512 total experts (10 routed + 1 shared), Zero-Centered RMSNorm for stability, and improved MoE router initialization, resulting in ~10× cheaper training and 10× faster inference compared to previous models. Alibaba's Qwen3-Next reportedly outperforms Gemini-2.5-Flash-Thinking and approaches the flagship 235B model's performance, with deployments on Hugging Face, Baseten, and native vLLM support for efficient inference.

Sep 02, 2025

Anthropic raises $13B at $183B Series F

claude-code gpt-5 grok-4 claude sonnet-4 glm-4.5 deepseek-r1 anthropic mistral-ai x-ai salesforce galileo openpipe zhipu thudm enterprise-connectors agent-benchmarking reinforcement-learning inference-optimization memory-optimization cuda multi-token-prediction speculative-decoding tensor-offload performance-optimization real-time-guardrails cost-optimization swyx emilygsands _philschmid _lewtun omarsar0 _avichawla corbtt

Anthropic achieved a $183B post-money valuation in Series F funding by September 2025, growing from about $1B run-rate in January to over $5B run-rate by August 2025. Their Claude Code product saw >10x usage growth in three months and reached $500M run-rate revenue, serving over 300,000 business customers with a nearly 7x increase in large accounts. Mistral AI launched Le Chat with 20+ MCP connectors integrating with major SaaS platforms and persistent memory features. Benchmarking updates highlight GPT-5 leading agent intelligence indices, with strong performances from xAI's Grok and Anthropic's Claude families. Reliability tooling and agent evaluation advances were shared by Galileo, OpenPipe, and others. Zhipu/THUDM open-sourced Slime v0.1.0, enhancing RL infrastructure behind GLM-4.5 with significant decoding speed improvements and advanced tensor offload techniques.

Dec 27, 2024

DeepSeek v3: 671B finegrained MoE trained for $5.5m USD of compute on 15T tokens

deepseek-v3 gpt-4o claude-3.5-sonnet llama-3 deepseek-ai hugging-face openai anthropic mixture-of-experts model-training model-optimization reinforcement-learning chain-of-thought multi-token-prediction synthetic-data model-distillation fine-tuning attention-mechanisms gpu-optimization nrehiew_ denny_zhou

DeepSeek-V3 has launched with 671B MoE parameters and trained on 14.8T tokens, outperforming GPT-4o and Claude-3.5-sonnet in benchmarks. It was trained with only 2.788M H800 GPU hours, significantly less than Llama-3's 30.8M GPU-hours, showcasing major compute efficiency and cost reduction. The model is open-source and deployed via Hugging Face with API support. Innovations include native FP8 mixed precision training, Multi-Head Latent Attention scaling, distillation from synthetic reasoning data, pruning and healing for MoEs with up to 256 experts, and a new multi-token prediction objective enabling lookahead token planning. Research highlights also cover the OREO method and Natural Language Reinforcement Learning (NLRL) for multi-step reasoning and agent control.