All tags
Model: "claude-sonnet-4.5"
not much happened today
trinity-large-thinking glm-5v-turbo falcon-perception qwen-3.5 claude-4.6-opus claude-sonnet-4.5 arcee z-ai tii anthropic h-company open-weights agentic-performance vision multimodality transformer-architecture early-fusion ocr gui-navigation context-compression tooling feature-flags production-ablations task-budget-management streaming modular-architecture mark_mcquade latkins willccbb xlr8harder natolambert craig_hewitt zhihu_frontier
Arcee’s Trinity-Large-Thinking was released with open weights under Apache 2.0, featuring a 400B total / 13B active model size and strong agentic performance, ranking #2 on PinchBench. Z.ai’s GLM-5V-Turbo is a vision coding model with native multimodal fusion and a CogViT encoder, integrated into multiple platforms. TII’s Falcon Perception offers an open-vocabulary referring expression segmentation model with an early-fusion transformer and a competitive 0.3B OCR model. H Company’s Holo3 is a GUI-navigation model family based on Qwen3.5. A Claude Code leak revealed a minimalist agent core with a 4-layer context compression stack, 40+ tool modular architecture, and advanced features like task budget management and streaming tool execution. The leak highlights Anthropic’s agent design and operational sophistication.
not much happened today
nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency
NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."
Anthropic Claude Sonnet 4.5, Claude Code 2.0, new VS Code Extensions
claude-sonnet-4.5 claude-code-v2 deepseek-v3.2-exp anthropic deepseek openai stripe swe-bench finance law stem code-execution context-editing memory-management api chrome-extension generative-ui sparse-attention long-context cost-efficiency john_schulman mike_krieger
Anthropic launched a major update with Claude Sonnet 4.5, achieving 77.2% SWE-Bench verified performance and improvements in finance, law, and STEM. They also released Claude Code v2 featuring checkpoints, a refreshed terminal, and a native VS Code extension, plus a new mascot Clawd. The Claude API gained context editing and memory tools, and the Claude Agent SDK was introduced. The Claude.ai apps now support code execution and file creation, with a Chrome extension available for Max users. Additionally, Imagine with Claude offers a generative UI research preview. Reception has been positive from developers and third-party evaluators. Meanwhile, DeepSeek released V3.2-Exp with a new Sparse Attention algorithm, significantly reducing long-context costs and cutting API prices by over 50%, while maintaining quality.