Person: "_catwu"

arc-agi-3 claude-code anthropic langchain arcprize primeintellect agentic-reasoning interactive-environments benchmarking efficiency-metrics zero-preparation-generalization agent-infrastructure trainable-agents classifier-approval fchollet mikeknoop scaling01 _rockt mark_k andykonwinski bradenjhancock jeremyphoward togelius bracesproul hwchase17 caspar_br _catwu

ARC-AGI-3 benchmark introduced by @arcprize and François Chollet resets the frontier for general agentic reasoning with humans solving 100% of tasks versus under 1% for current models, focusing on zero-preparation generalization and human-like learning efficiency. The scoring protocol sparked debate over its harsh efficiency-based metric compared to prior ARC versions and other benchmarks like NetHack. The community acknowledges the benchmark highlights weaknesses in current LLM agents in interactive, sparse-feedback environments. Concurrently, agent infrastructure advances with LangChain launching Fleet shareable skills for reusable domain knowledge, and Anthropic revealing Claude Code auto mode for classifier-mediated approval balancing autonomy and manual confirmation. Browser and coding agents are evolving into trainable systems beyond prompt wrappers, exemplified by BrowserBase and Prime Intellect collaboration.

Mar 13

not much happened today

opus-4.6 glm-5 anthropic ibm perplexity-ai llamaindex deepseek google-chrome persistent-memory agent-infrastructure cross-device-synchronization long-context sparse-attention inference-optimization computer-architecture task-completion systems-performance pamelafox tadasayy llama_index bromann dair_ai omarsar0 abxxai teknuim bcherny kimmonismus _catwu alexalbert__ realyushibai

MCP tools remain relevant for deterministic APIs despite ergonomic criticisms, with new web MCP support in Chrome v146 enabling continuous browsing agents. Persistent memory is emerging as a key differentiator for agents, with IBM improving task completion rates and multi-agent memory framed as a computer architecture challenge. Agent UX is evolving towards always-on, cross-device operation, exemplified by Perplexity Computer on iOS and Claude Code session management. Anthropic released Opus 4.6 1M context as default with no extra long-context API charges, achieving 78.3% on MRCR v2 at 1M tokens. Sparse attention optimizations like IndexCache in DeepSeek Sparse Attention yield significant speedups on large models with minimal code changes.

Feb 24

Claude Code Anniversary + Launches from: Qwen 3.5, Cursor Demos, Cognition Devin 2.2, Inception Mercury 2

qwen3.5-flash qwen3.5-35b-a3b qwen3.5-122b-a10b qwen3.5-27b qwen3.5-397b-a17b gpt-5.3-codex claude-code alibaba openai anthropic cursor huggingface model-architecture reinforcement-learning quantization context-windows agentic-ai api websockets software-ux enterprise-workflows model-deployment awnihannun andrew_n_carr justinlin610 unslothai terryyuezhuo haihaoshen 0xsero ali_tongyilab scaling01 gdb noahzweben _catwu

Alibaba launched the Qwen 3.5 Medium Model Series featuring models like Qwen3.5-Flash, Qwen3.5-35B-A3B (MoE), and Qwen3.5-122B-A10B (MoE) emphasizing efficiency over scale with innovations like 1M context and INT4 quantization. OpenAI released GPT-5.3-Codex via the Responses API with enhanced file input support and faster web socket-based throughput. Anthropic introduced Claude Code Remote Control enabling terminal session continuation from mobile and expanded enterprise workflow features. Cursor shifted UX to agent demo videos instead of diffs, highlighting new interaction modes.

Jan 06

xAI raises $20B Series E at ~$230B valuation

grok-5 claude-code xai nvidia cisco fidelity valor-equity-partners qatar-investment-authority mgx stepstone-group baron-capital-group hugging-face amd ai-infrastructure supercomputing robotics ai-hardware agentic-ai context-management token-optimization local-ai-assistants aakash_gupta fei-fei_li lisa_su clementdelangue thom_wolf saradu omarsar0 yuchenj_uw _catwu cursor_ai

xAI, Elon Musk's AI company, completed a massive $20 billion Series E funding round, valuing it at about $230 billion with investors like Nvidia, Cisco Investments, and others. The funds will support AI infrastructure expansion including Colossus I and II supercomputers and training Grok 5, leveraging data from X's 600 million monthly active users. At CES 2026, the focus was on "AI everywhere" with a strong emphasis on AI-first hardware and integration between NVIDIA and Hugging Face's LeRobot for robotics development. The Reachy Mini robot is gaining traction as a consumer robotics platform. In software, Claude Code is emerging as a popular local/private coding assistant, with new UI features in Claude Desktop and innovations like Cursor's dynamic context reducing token usage by nearly 47% in multi-MCP setups. "The 600 million MAU figure in xAI’s announcement combines X platform users with Grok users. That’s a clever framing choice."

Dec 02, 2025

Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models

mistral-large-3 ministral-3 clara-7b-instruct gen-4.5 claude-code mistral-ai anthropic apple runway moondream sparse-moe multimodality benchmarking open-source model-licensing model-performance long-context inference-optimization instruction-following local-inference code-generation model-integration anjney_midha _akhaliq alexalbert__ _catwu mikeyk

Mistral has launched the Mistral 3 family including Ministral 3 models (3B/8B/14B) and Mistral Large 3, a sparse MoE model with 675B total parameters and 256k context window, all under an Apache 2.0 open license. Early benchmarks rank Mistral Large 3 at #6 among open models with strong coding performance. The launch includes broad ecosystem support such as vLLM, llama.cpp, Ollama, and LM Studio integrations. Meanwhile, Anthropic acquired the open-source Bun runtime to accelerate Claude Code, which reportedly reached a $1B run-rate in ~6 months. Anthropic also announced discounted Claude plans for nonprofits and shared insights on AI's impact on work internally.

Aug 20, 2025

DeepSeek V3.1: 840B token continued pretrain, beating Claude 4 Sonnet at 11% of its cost

deepseek-v3.1 seed-oss-36b computerrl gemini-2.5-pro gpt-5 claude-code gpt-oss-120b gpt-oss-20b deepseek bytedance zhipu-ai github microsoft anthropic together-ai baseten huggingface token-efficiency coding agentic-benchmarks long-context reinforcement-learning developer-tools fine-tuning multinode-training model-release teortaxestex rasbt lukehoban burkeholland _catwu cline winglian

DeepSeek released DeepSeek V3.1, a quietly rolled out open model with an 128K context window and improvements in token efficiency, coding, and agentic benchmarks. ByteDance launched the permissive Seed-OSS 36B model on Hugging Face, noted for long-context and reasoning capabilities. Zhipu AI introduced ComputerRL, a reinforcement learning framework for computer-use agents, achieving strong benchmark results. In developer tooling, GitHub Copilot expanded globally, Microsoft VS Code integrated Gemini 2.5 Pro and updated GPT-5 agent prompts, and Anthropic launched Claude Code seats with spend controls. Open-source fine-tuning advances include Together AI adding SFT for gpt-oss-120B/20B and Baseten enabling multinode 120B training with Truss CLI. The community noted mixed performance and ongoing post-training adjustments for DeepSeek V3.1.

Aug 13, 2025

not much happened today

gpt-5 gpt-oss-120b opus-4.1 sonnet-4 openai anthropic minimax context-windows model-routing model-hosting multi-tool-pipelines prompt-caching model-extraction model-pairing cost-efficiency model-optimization sama jeremyphoward jxmnop _catwu

OpenAI continues small updates to GPT-5, introducing "Auto/Fast/Thinking" modes with 196k token context, 3,000 messages/week, and dynamic routing to cheaper models for cost efficiency. The MiniMax AI Agent Challenge offers $150,000 in prizes for AI agent development by August 25. The community discusses GPT-OSS-120B base model extraction, hosting, and tooling improvements, including multi-tool pipelines and flex-attention. Anthropic announces model pairing in Claude Code with Opus 4.1 for planning and Sonnet 4 for execution, expanding context to 1M tokens and introducing prompt caching. Key figures include @sama, @jeremyphoward, @jxmnop, and @_catwu.