All tags
Model: "sonnet-4.5"
DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling
deepseek-v3.2 deepseek-v3.2-speciale gpt-5-high sonnet-4.5 gemini-3-pro deepseek_ai lm-arena agentic-ai reinforcement-learning large-context-windows model-benchmarking model-performance multi-agent-systems model-training model-deployment suchenzang teortaxestex
DeepSeek launched the DeepSeek V3.2 family including Standard, Thinking, and Speciale variants with up to 131K context window and competitive benchmarks against GPT-5-High, Sonnet 4.5, and Gemini 3 Pro. The release features a novel Large Scale Agentic Task Synthesis Pipeline focusing on agentic behaviors and improvements in reinforcement learning post-training algorithms. The models are available on platforms like LM Arena with pricing around $0.28/$0.42 per million tokens. Community feedback is mixed, praising the frontier reasoning capabilities but critiquing the chat UI experience. Key figures include Susan Zhang and Teortaxes who provided commentary on the release.
Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus
claude-opus-4.5 gemini-3-pro gpt-5.1-codex-max opus-4.1 sonnet-4.5 anthropic amazon google anthropic coding agents tool-use token-efficiency benchmarking api model-pricing model-performance effort-control context-compaction programmatic-tool-calling alexalbert__ btibor91 scaling01 klieret
Anthropic launched Claude Opus 4.5, a new flagship model excelling in coding, agents, and tooling with a significant 3x price cut compared to Opus 4.1 and improved token efficiency using 76% fewer output tokens. Opus 4.5 achieved a new SOTA on SWE-bench Verified with 80.9% accuracy, surpassing previous models like Gemini 3 Pro and GPT-5.1-Codex-Max. The update includes advanced API features such as effort control, context compaction, and programmatic tool calling, improving tool accuracy and reducing token usage. Claude Code is now bundled with Claude Desktop, and new integrations like Claude for Chrome and Excel are rolling out. Benchmarks show Opus 4.5 breaking the 80% barrier on SWE-bench Verified and strong performance on ARC-AGI-2 and BrowseComp-Plus.
Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE
gemini-3-pro gemini-2.5 grok-4.1 sonnet-4.5 gpt-5.1 google google-deepmind multimodality agentic-ai benchmarking context-window model-performance instruction-following model-pricing api model-release reasoning model-evaluation sundarpichai _philschmid oriol_vinyals
Google launched Gemini 3 Pro, a state-of-the-art model with a 1M-token context window, multimodal reasoning, and strong agentic capabilities, priced significantly higher than Gemini 2.5. It leads major benchmarks, surpassing Grok 4.1 and competing closely with Sonnet 4.5 and GPT-5.1, though GPT-5.1 excels in ultralong summarization. Independent evaluations from Artificial Analysis, Vending Bench, ARC-AGI 2, Box, and PelicanBench validate Gemini 3 as a frontier LLM. Google also introduced Antigravity, an agentic IDE powered by Gemini 3 Pro and other models, featuring task orchestration and human-in-the-loop validation. The launch marks Google's strong return to AI with more models expected soon. "Google is very, very back in the business."
not much happened today
gpt-5.1 sonnet-4.5 opus-4.1 gemini-3 openai anthropic langchain-ai google-deepmind adaptive-reasoning developer-tools prompt-optimization json-schema agent-workflows context-engineering structured-outputs model-release benchmarking swyx allisontam_ gdb sama alexalbert__ simonw omarsar0 abacaj scaling01 amandaaskell
OpenAI launched GPT-5.1 featuring "adaptive reasoning" and developer-focused API improvements, including prompt caching and a reasoning_effort toggle for latency/cost tradeoffs. Independent analysis shows a minor intelligence bump with significant gains in agentic coding benchmarks. Anthropic's Claude models introduced structured outputs with JSON schema compliance in public beta for Sonnet 4.5 and Opus 4.1, enhancing tooling and code execution workflows. Rumors of an Opus 4.5 release were debunked. LangChain released a "Deep Agents" package and context-engineering playbook to optimize agent workflows. The community is eagerly anticipating Google DeepMind's Gemini 3 model, hinted at in social media and upcoming AIE CODE events. "Tickets are sold out, but side events and volunteering opportunities are available."