All tags
Model: "codex"
Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows
mai-thinking-1 mai-code-1-flash holo-3.1 qwen-35b sonnet-4.6 claude-code codex microsoft openrouter fal baseten hcompany_ai teksedge nous-research teknim cognition windsurf perplexity-ai mixture-of-experts context-windows benchmarking reinforcement-learning prompt-optimization agentic-ai local-inference model-family-expansion model-reporting agent-native-devices software-development model-optimization hybrid-inference desktop-agents model-quantization mustafasuleyman eliebakouch hannahajishirzi asadovsky bj2rn lateinteraction lakshyaaagrawal theturingpost kimmonismus yusuf_i_mehdi pierceboggan lukehoban nielsrogge russelljkaplan
Microsoft introduced MAI-Thinking-1, a 35B parameter MoE model with 256K context, achieving 97% on AIME 2025 and outperforming Sonnet 4.6 in human preference tests. The broader 7-model MAI family spans reasoning, code, image, speech, and voice, with third-party availability on OpenRouter, fal, and Baseten. The detailed 109-page technical report revealed insights on scaling, MFU, RL/post-training, and data curation, highlighting no third-party distillation and advanced prompt optimization techniques. Microsoft emphasized agent-native devices and local inference with projects like Project Solara / Scout and the Surface RTX Spark Dev Box, alongside software innovations such as the Copilot desktop app and MAI-Code-1-Flash integration. Meanwhile, local-first computer-use agents like Holo 3.1 (Qwen-based, 0.8B to 35B parameters) support laptops and small workstations with optimized formats and strong benchmark results. Desktop shells for agents, including Hermes Desktop, Devin Desktop, and agent-neutral approaches compatible with Devin, Claude Code, and Codex, are proliferating, with hybrid local/cloud execution becoming the default architecture as seen in Perplexity Computer's hybrid agentic inference.
not much happened today
claude-code codex composer-2.5 langchain cognition anthropic openai microsoft cursor agent-automation agent-observability ci-cd prompt-caching remote-execution verification decomposition feedback-loops coding-agents model-efficiency instruction-following krishdpi walden_yan russelljkaplan fchollet gabriberton palashshah shannholmberg
Agent infrastructure is advancing with LangSmith Engine providing CI/CD loops for agents and SmithDB enabling low-latency querying for observability. Cognition's Devin Auto-Triage offers persistent automation for bug triage with memory and subagent structures. Anthropic improves Claude Code for large codebases with prompt cache diagnostics and faster modes, while OpenAI enhances Codex workflows with remote execution and plugins. Microsoft released remote control for GitHub Copilot CLI and VS Code. The community emphasizes verification, decomposition, and feedback loops over prompt cleverness for coding agents. Cursor's Composer 2.5 is highlighted as a strong new coding model, with plans for a larger model trained with SpaceXAI using 10ร more compute on Colossus 2 hardware, praised for efficiency and collaboration improvements.
not much happened today
codex chatgpt openai github microsoft nous-research moonshot-ai langchain prime-intellect agent-infrastructure agent-first-ux remote-ssh programmatic-access-tokens sandboxing continual-learning agent-trace-data multi-agent-workflows ide-integration browser-extensions hwchase17 caspar_br bentannyhill jakebroekhuizen willccbb
OpenAI expanded Codex integration with the ChatGPT mobile app enabling remote task management and introduced Remote SSH, hooks, and programmatic tokens for enterprise automation. The IDE ecosystem is shifting to "agent-first" UX with GitHub Copilot App preview and VS Code launching a multi-agent workflow window. Open-source agents like Nous/Hermes integrated Codex runtime, and Kimi released a web bridge extension supporting multiple coding agents. LangChain released significant agent infrastructure including SmithDB for agent trace data and LangSmith Engine for trace analysis and continual learning, launching LangChain Labs to improve agents via production trace feedback loops.
not much happened today
claude codex langsmith-engine smithdb duet-agent multi-stream-llm delta-mem star-elastic cline langchain notion cursor nous-research nvidia datology agent-infrastructure developer-platforms observability long-running-state streaming orchestration pretraining-efficiency model-architecture external-memory post-training-compression data-curation vision-language-models jonas_geiping siddharth_joshi pratyush_maini
Cline, LangChain, Notion, and Cursor advanced agent infrastructure and developer platforms with innovations like Cline SDK, LangSmith Engine, SmithDB (offering 12โ15ร faster observability), and Notion's External Agents API integrating third-party agents such as Claude and Codex. Agent UX trends emphasize long-running state, streaming, and orchestration over chat, with tools like Duet Agent and VS Code Agents window enhancing durable execution and inspectable states. Research highlights include Nous Research's Token Superposition Training achieving 2โ3ร speedup in pretraining, a multi-stream LLM architecture for parallel reasoning by Jonas Geiping et al., and ฮด-mem external memory improving benchmark scores. NVIDIA's Star Elastic offers post-training model compression at 360ร lower cost than pretraining, while Datology focuses on data curation for vision-language models.
not much happened today
gpt-5.5 codex thinking-machines openai anthropic multimodality real-time-interaction visual-proactivity deployment cybersecurity threat-modeling automation continuous-audio-video-text-processing security-models field-engineering enterprise-ai johnschulman2 soumithchintala chillee liliyu_lili rown kimmonismus giffmana swyx eliebakouch gdb sama therundownai lukolejnik matvelloso
Thinking Machines previewed their new native interaction models designed for full-duplex multimodal interaction enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach emphasizes continuous audio, video, and text processing, with innovations like visual proactivity and background tool use, implemented using SGLang. Meanwhile, OpenAI announced the OpenAI Deployment Company, a new unit with 150 Forward Deployed Engineers and $4B initial investment to help enterprises deploy frontier models, signaling a move into the deployment layer of the AI economy. OpenAI also launched Daybreak, a security-focused initiative integrating GPT-5.5 and Codex for cyber defense, threat modeling, and automated patching, offering differentiated access tiers including GPT-5.5-Cyber. This contrasts with Anthropic's more restrictive cyber approach, highlighting tensions in AI security strategies.
not much happened today
gpt-5.5 gpt-image-2 gpt-5.5-pro gpt-5.5-instant gpt-realtime-2 gpt-5.5-cyber codex zaya1-74b-preview zaya1-vl-8b qwen3-omni openai zyphra amd deepseek vllm_project model-release model-training mixture-of-experts inference model-optimization sandboxing alignment cybersecurity agent-runtime throughput quantization telemetry real-time-detection reach_vb dhh gdb patience_cave ithilgore cryps1s sama deredleritt3r
OpenAI rapidly expanded the GPT-5.5 family with multiple variants including gpt-image-2, GPT-5.5 Pro, and GPT-5.5 Cyber, receiving positive feedback for efficiency and usability. Codex evolved into a long-running agent runtime with a new /goal mechanism, achieving 61% success on ARC-AGI-3 games after extensive testing. OpenAI also introduced cybersecurity-focused models like GPT-5.5-Cyber targeting enterprise and government sectors. Meanwhile, Zyphra released the open-model ZAYA1-74B-Preview, a 74B parameter mixture-of-experts model trained on AMD hardware under Apache 2.0 license, alongside a vision-language model ZAYA1-VL-8B. Inference infrastructure competition intensified with vLLM updates improving throughput and latency, including support for DeepSeek V4 and enhanced quantization/backends.
GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
gpt-realtime-2 gpt-5.5 codex openai anthropic goodfireai scale-ai voice-models streaming-translation transcription benchmarking context-windows browser-automation cybersecurity interpretability neural-geometry manifolds ai-safety rlhf micahcarroll milesbrundage ryanpgreenblatt
OpenAI released GPT-Realtime-2, a voice model with GPT-5-class reasoning, tool use, interruption handling, and extended context windows up to 128K tokens, achieving top scores on Big Bench Audio and Conversational Dynamics benchmarks. They also launched a Chrome plugin for Codex enabling browser control and multitasking, and introduced GPT-5.5 with Trusted Access for Cyber for secure defensive workflows and red teaming. Anthropic introduced Natural Language Autoencoders for interpreting model activations as human-readable text, aiding interpretability and debugging, while Goodfire proposed a neural geometry research agenda focusing on manifolds as primitives for neural network behavior. Anthropic also announced The Anthropic Institute to advance AI safety and economic resilience research.
not much happened today
gpt-5.5-instant codex openai langchain deepseek personalization voice real-time-api webrtc agent-frameworks coding-agents model-harness benchmarking automation task-automation developer-tools sama michpokrass ericmitchellai kimmonismus reach_vb vtrivedy10 sydneyrunkle masondrxy 0xsero teortaxestex theethanding finbarrtimbers
OpenAI rolled out GPT-5.5 Instant as the new default for ChatGPT and API, enhancing factuality, intelligence, image understanding, and tone with stronger personalization features like saved memories and Gmail integration. OpenAI also shared infrastructure updates on a rebuilt WebRTC stack for voice and real-time API, aiming to reduce latency for speech-paced conversations. Developer tools expanded with an Agents SDK for TypeScript, sandbox agents, and open-source harnesses, improving coding and automation workflows. Discussions highlighted the importance of ModelโHarnessโTask fit over raw model quality for agent performance, with debates on agent coding UX and benchmarks. Community sentiment praises GPT-5.5 for high-token-budget coding and non-coding tasks.
not much happened today
codex deepseek-v4-pro gemini-3.5-flash gemini-3.1-pro gpt-5.5 claude-opus-4.7 openai claude deepseek gemini qwen model-performance cost-curves agent-products workflow-optimization product-differentiation benchmarking model-optimization gdb dzhng signulll teortaxestex ajambrosino reach_vb theo claudedevs _mohansolo artificialanlys scaling01 yuchenj_uw kimmonismus officiallogank designarena alezander907 giffmana jeremyphoward hamelhusain
AI News for 5/4/2026-5/5/2026 highlights a shift in AI product development emphasizing model + harness + workflow + UI + memory + economics over model quality alone, with notable updates from OpenAI Codex and Claude including new features like Appshots, auto mode, and Sonnet 4.6. DeepSeek made a significant market impact by permanently discounting DeepSeek-V4-Pro by 75%, drastically improving cost/performance ratios compared to Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.7. Meanwhile, Gemini 3.5 Flash showed benchmark improvements but received mixed feedback on practical utility. The competitive landscape continues to tighten with Qwen and other Chinese frontier models.
not much happened today
codex openai microsoft cursor_ai langchain-ai agentic-harness-engineering agent-loop-systems-engineering performance-optimization semantic-indexing prompt-evaluation software-engineering sdk-development model-tuning recursive-self-improvement omarsar0 samhogan kimmonismus reach_vb pierceboggan
OpenAI is expanding Codex from a coding tool to a general work surface with persistent context, tools, integrations, and team rollout, including Codex-only seats with $0 seat fee for Business/Enterprise customers through June. Performance improvements focus on agent-loop systems engineering, achieving up to 40% faster agentic workflows via WebSocket mode on the Responses API. VS Code enhances coding-agent UX with semantic indexing, cross-repo search, chat session insights, and prompt/agent evaluation extensions. Cursor launches a Cursor SDK to enable programmable agent infrastructure for CI/CD, automations, and embedded agents, signaling a shift toward headless agent runtimes and usage-based economics. Research highlights Agentic Harness Engineering improving Terminal-Bench 2 pass@1 from 69.7% to 77.0%, surpassing human-designed baselines and reducing token use by 12%. Related work on HALO shows recursive self-improving agents with significant AppWorld score improvements. LangChainโs Deep Agents introduces Harness Profiles for model-specific harness tuning and deployability.
not much happened today
gpt-5.5 gpt-5.4 opus-4.7 mimo-v2.5-pro mimo-v2.5 kimi-k2.6 codex copilot openai microsoft google amazon github xiaomi openai-devs vllm_project kimi-moonshot model-distribution cloud-computing benchmarking usage-based-billing model-orchestration open-source large-context-models agent-scaling coding model-training fp8 attention-mechanisms multi-agent-systems sama scaling01 kimmonismus ajassy simonw htihle arena gdb hangsiin eliebakouch _luofuli teortaxestex
OpenAI loosens its Azure exclusivity, allowing distribution across Google TPU, AWS Trainium, and Bedrock with commitments through 2032 and revenue share through 2030. GPT-5.5 shows improved benchmarks but is not uniformly dominant, ranking variably across coding, document, math, and vision tasks. GitHub's Copilot shifts to usage-based billing starting June 1, reflecting increased runtime costs. OpenAI open-sourced Symphony, an orchestration layer for issue tracking and Codex agents. Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro, large context models with up to 1M-token context and trillions of tokens trained, emphasizing complex agent and omni-modal capabilities. Kimi K2.6 leads OpenRouter's leaderboard, noted for coding and long-horizon agent capabilities with large-scale sub-agent coordination.
GPT-Image-2
gpt-image-2 qwen3-1.7b codex openai hugging-face figma canva adobe nous-research image-generation multilingual-models model-integration benchmarking agent-infrastructure multi-process-systems fine-tuning scientific-reasoning healthcare-ai hierarchical-decomposition clementdelangue lewtun gdb nickaturley mark_k petergostev tekninum mayank_022
OpenAI launched GPT-Image-2, enhancing image generation with improved text rendering, layout fidelity, editing, multilingual support, and "thinking" capabilities. It supports generating slides, infographics, diagrams, UI mockups, and QR codes, and integrates with tools like Figma, Canva, Adobe Firefly, and Hermes Agent. Benchmarks show GPT-Image-2 leads image generation tasks with a +242 Elo advantage. Hugging Face released ml-intern, an open-source agent automating post-training research loops, improving scientific reasoning and healthcare benchmarks significantly. Hermes is evolving into a richer local/open agent platform with enhanced multi-process orchestration capabilities.
not much happened today
claude-opus-4.7 gemini-3.1-pro gpt-5.4 claude-code codex anthropic openai agentic-ai model-benchmarking adaptive-reasoning cost-efficiency computer-use prototyping-tools code-generation model-performance software-integration claudeai yuchenj_uw kimmonismus skirano therundownai arena artificialanlys victortaelin emollick alexalbert__ theo scaling01 reach_vb kr0der hamelhusain mattrickard matvelloso gdb
Anthropic launched Claude Design, a prototyping tool powered by Claude Opus 4.7, targeting design workflows and competing with Figma and others. Benchmarks show Opus 4.7 leading in coding and text tasks, with improved efficiency and adaptive reasoning, though early user feedback noted some regressions and stability issues. Discussions highlighted its cost-efficiency and agentic capabilities compared to Gemini 3.1 Pro and GPT-5.4. Meanwhile, OpenAI's Codex updates introduced advanced computer-use features enabling fast, agentic control of desktop apps and enterprise software, signaling progress toward practical AGI-like agents.
Anthropic's Claude Opus 4.7
claude-opus-4.7 codex gpt-rosalind anthropic openai cursor replit perplexity-ai microsoft coding agentic-ai tokenization long-context benchmarking image-processing software-engineering computer-use plugin-integration multi-terminal-support ssh-access model-expansion bcherny kimmonismus scaling01 valsai artificialanlys natolambert nrehiew_
Anthropic launched Claude Opus 4.7, its most capable Opus model yet, featuring stronger coding and agentic performance, a new tokenizer, and improved long-context handling with a new xhigh reasoning tier. Benchmarks show substantial gains, including SWE-bench Pro 64.3%, SWE-bench Verified 87.6%, and TerminalBench 69.4%, with top rankings on Vals Index and GDPval-AA. Technical changes include a new tokenizer and increased image input resolution to 3.75MP. Some long-context benchmarks showed mixed results, with a shift in focus from MRCR to Graphwalks. Adoption was rapid across tools like Cursor, VS Code, Replit Agent, and Perplexity. Meanwhile, OpenAI expanded Codex into a broader computer agent with Mac computer use, in-app browser, image generation/editing, 90+ plugins, multi-terminal support, SSH remote devbox access, and richer file previews. A new vertical life-sciences model, GPT-Rosalind, was also introduced.
not much happened today
codex openai github cursor langchain nous-research agent-harnesses multi-agent-systems software-engineering tooling orchestration observability remote-control security-hardening user-experience open-source community-engagement andrew_ng steve_yegge gabrielchua giffmana rhys_sullivan teknium shaun_furman dabit3 robinebers zainanzhou nicoalbanese10 bromann elliothyun tiagonbotelho pierceboggan sydneyrunkle
Harness engineering is emerging as a key discipline in AI agent development, emphasizing components like filesystems, memory, and retries beyond just models. OpenAI's Codex is expanding agentic coding workflows beyond software engineering, including codebase understanding and bug triage. Tooling trends show convergence on multi-agent orchestration, observability, and remote control, with GitHub Copilot, Cursor, and LangChain advancing these capabilities. The Hermes Agent v0.9.0 release introduces a local web dashboard and enhanced security, gaining community traction over OpenClaw for UX and efficiency. The open agent ecosystem is growing with projects like Open Agents and DeepAgent providing modular stacks and runtimes.
not much happened today
claude-code codex hermes-agent anthropic openai nous-research huggingface closed-loop-verification cross-agent-composition agent-ecosystem multi-agent-systems runtime-orchestration tooling fine-tuning remote-monitoring privacy sandboxing omarsar0 dkundel reach_vb theo jayfarei kaiostephens icarushermes winglian clementdelangue fchollet
Anthropic introduced computer use inside Claude Code for closed-loop verification in a research preview for Pro/Max users, enhancing reliable app iteration. OpenAI released a Codex plugin for Claude Code, enabling cross-agent composition and signaling a shift toward composable coding harnesses. OpenAI also noted that late-night Codex tasks run longer, supporting background agent delegation. Nous Research's Hermes Agent saw rapid adoption due to better compaction, adaptability, and multi-agent profiles, evolving toward an agent OS abstraction. An ecosystem around Hermes includes tools for trace analytics, fine-tuning, and remote control, with debates on open-source versus proprietary agent infrastructure. Key themes include tooling, prompt/runtime orchestration, and review loops as critical factors beyond model capabilities.
not much happened today
gpt-5.4-mini gpt-5.4-nano gpt-5.4 codex openai langchain stripe ramp coinbase nous-research hermes-agent coding multimodality subagents context-window model-performance pricing behavior-tuning secure-execution plugin-architecture attention-mechanisms agent-infrastructure hwchase17 michpokrass
OpenAI released GPT-5.4 mini and GPT-5.4 nano, their most capable small models optimized for coding, multimodal understanding, and subagents, featuring a 400k context window and over 2x speed compared to GPT-5 mini. The mini model approaches larger GPT-5.4 performance while using only 30% of Codex quota, becoming the default for many coding workflows. Pricing concerns and truthfulness tradeoffs were noted, with mixed third-party evaluations on reasoning and resistance to false premises. OpenAI also addressed behavior tuning issues in a recent update. Meanwhile, agent infrastructure is evolving with secure code execution and orchestration tools like LangChain's LangSmith Sandboxes and Open SWE, inspired by internal systems at Stripe, Ramp, and Coinbase. Subagents and secure execution are now key product features, with releases like Hermes Agent v0.3.0 showcasing plugin architectures, live Chrome control, and voice mode. Research on attention mechanisms, including Attention Residuals and vertical attention, is gaining traction.
not much happened today
kimi-linear-48b codex gpt-5.4 claude-code moonshot openai assemblyai langchain attention-mechanisms model-architecture inference-speed agent-feedback agent-skills multi-agent-systems knowledge-transfer cli-tools coding-agents model-deployment kimi_moonshot elonmusk yuchenj_uw nathancgy4 eliebakouch tokenbender behrouz_ali cloneofsimo fidjissimo sama gdb andrewyng itsafiz simplifyinai
Moonshot's Attention Residuals paper introduced an input-dependent attention mechanism over prior layers with a 1.25x compute advantage and less than 2% inference latency overhead, validated on Kimi Linear 48B total / 3B active. The paper sparked debate on novelty versus prior art like DeepCrossAttention and Googleโs earlier work, highlighting tensions in idea novelty, citation quality, and frontier-scale validation. OpenAI's Codex showed strong momentum with over 2M weekly active users, nearly 4x growth YTD, and GPT-5.4 hitting 5T tokens/day and a $1B annualized run-rate. Codex added subagents supporting multi-agent coding workflows. Infrastructure for coding agents matured with tools like Context Hub / chub supporting agent feedback loops, AssemblyAI's skill for Claude Code and Codex, and automated skill extraction from GitHub repos yielding 40% knowledge-transfer gains. LangChain launched LangGraph CLI and open-sourced Deep Agents, recreating top coding agent workflows with planning, filesystem ops, shell access, and sub-agents.
Autoresearch: Sparks of Recursive Self Improvement
claude-3 codex anthropic openai cognition automated-machine-learning coding-agents bug-fixing model-autonomy multi-agent-systems pr-review systems-engineering model-verification karpathy yi_tay jakub_pachocki
RSI covers AI developments from 3/5/2026 to 3/9/2026, highlighting the emergence of LLMs autonomously training smaller LLMs, marking a significant "AutoML moment" in AI progress. Karpathy and Yi Tay discuss "vibe training," where AI models fix bugs and improve code autonomously, suggesting models may soon surpass human debugging efficiency. The report anticipates Jakub Pachocki's Automated AI Research Intern system by September 2026 to accelerate human researchers. On AI Twitter, the focus is on coding agents shifting bottlenecks from implementation to review and verification, with Anthropic's Claude Code Review improving PR review effectiveness significantly, and tools like OpenAI Codex Review and Cognition's Devin Review enhancing code review workflows. Harness engineering is evolving into systems engineering, emphasizing decoupling agent storage from compute for collaborative agent teams.
OpenAI closes $110B raise from Amazon, NVIDIA, SoftBank in largest startup fundraise in history @ $840B post-money
codex chatgpt openai softbank nvidia amazon microsoft model-scaling model-metrics investment cloud-computing infrastructure training-capacity user-growth partnerships sama
OpenAI has closed a major funding round totaling $110 billion at a $730 billion pre-money valuation, with investments from SoftBank ($30B), NVIDIA ($30B), and Amazon ($50B). Key user metrics include 1.6 million weekly Codex users, over 9 million paying business users of ChatGPT, and more than 900 million weekly active ChatGPT users with 50 million consumer subscribers. The partnership with Amazon includes exclusive cloud services and 2 gigawatts of Trainium capacity. Microsoft maintains a reduced partnership with stateless APIs. This funding round is one of the largest in history, highlighting OpenAI's dominant position in AI adoption and infrastructure.
Anthropic accuses DeepSeek, Moonshot, and MiniMax of "industrial-scale distillation attacks".
claude claude-3 codex claude-code anthropic deepseek moonshot-ai minimax openai ollama api-abuse-resistance model-security agentic-engineering coding-agents model-distillation workflow-automation sandboxing realtime-communication simon_willison
Anthropic alleges industrial-scale distillation attacks on its Claude model by DeepSeek, Moonshot AI, and MiniMax, involving ~24,000 fraudulent accounts and >16M Claude exchanges to extract capabilities, raising concerns about competitive risks and safety. The community debates the difference between scraping and API-output extraction, highlighting a shift toward protecting models via API abuse resistance techniques. Meanwhile, coding agents like Codex and Claude Code see real adoption and failures, with emerging best practices in "agentic engineering" led by Simon Willison. The OpenClaw ecosystem expands with alternatives like NanoClaw and integrations such as Ollama 0.17 simplifying open model usage.
ElevenLabs $500m Series D at $11B, Cerebras $1B Series H at $23B, Vibe Coding -> Agentic Engineering
gemini-3 claude codex google openai github microsoft deepmind agent-frameworks model-deployment benchmarking cost-optimization software-development async-processing gpu-acceleration coding-agents user-adoption game-theory workflow-integration sama sundarpichai reach_vb
Google's Gemini 3 is being integrated widely, including a new Chrome side panel and Nano Banana UX features, with rapid adoption and a 78% unit-cost reduction in serving costs. The Gemini app reached 750M+ MAU in Q4 2025, nearing ChatGPT's user base. Google is also benchmarking AI "soft skills" through games like Poker and Chess in the Kaggle Game Arena. Meanwhile, coding agents are converging in IDEs: VS Code launched Agent Sessions supporting Claude and Codex agents with features like parallel subagents and integrated browsers. GitHub Copilot now allows agent choice between Claude and OpenAI Codex for async backlog clearing. OpenAI reports 1M+ active users for Codex with expanded integration surfaces, though some users request better GPU support. The coding-agent ecosystem is professionalizing with community platforms like OpenClaw and tooling such as ClawHub and CLI updates. "Gemini 3 adoption faster than any other model" and "VS Code as home for coding agents" highlight major industry shifts.
OpenAI Codex App: death of the VSCode fork, multitasking worktrees, Skills Automations
codex openai agent-based-systems parallel-processing software-testing developer-workflows automation product-feedback-loop neurosymbolic-ai benchmarking sama reach_vb gdb skirano embirico ajambrosino thsottiaux nbaschez yuchenj_uw badlogicgames random_walker
OpenAI launched the Codex app on macOS as a dedicated agent-native command center for coding, featuring multiple agents in parallel, built-in worktrees for conflict isolation, skills for reusable bundles, and scheduled automations. The app emphasizes developer workflows like Plan mode for upfront task decomposition and is gaining positive adoption signals from insiders including @sama. There is movement towards ecosystem standardization of skills folders, signaling early conventions in agent tooling. Codex also exemplifies a "self-improving" product feedback loop combining humans and agents. In coding agents practice, best practices include a "test-first" approach to bug fixes, the "conductor" model where one developer manages 5-10 agents in parallel, and a neurosymbolic framing explaining why coding agents succeed due to software's verifiability and symbolic tooling. Benchmark skepticism remains about productivity studies that do not reflect agentic workflows.
not much happened today
claude-3 codex gemini gpt-5.2-pro anthropic openai google sakana-ai cursor baseten epoch-ai-research deepmind benchmarking reasoning continual-learning reinforcement-learning model-performance agentic-ai security model-training sama fchollet shane_legg demishassabis
Anthropic launches "Claude in Excel Pro" with enhanced features. OpenAI reveals upcoming Codex agent loop and cybersecurity measures. Google boosts Gemini App quotas and partners with Sakana AI for advanced AI Scientist projects in Japan. Cursor introduces Agent Skills for dynamic context focus. GPT-5.2 Pro achieves 31% on FrontierMath Tier 4, showing significant benchmark progress. Baseten raises $300M at a $5B valuation targeting high-performance inference. Discussions highlight math benchmarks as indicators of AI capability, uneven AGI progress, and the importance of reasoning and continual learning as future frontiers. Notable figures include Sam Altman, Franรงois Chollet, Shane Legg, and Demis Hassabis.
ChatGPT starts testing ads on free tier + new $8/mo Go plan in the US
chatgpt-go codex openai ollama ads monetization memory agent-orchestration human-in-the-loop cli-tools context-length workflow-optimization sama sam_altman fidjissimo scaling01 tomwarren embirico adamdotdev ollama thsottiaux lateinteraction dbreunig
OpenAI announced the ChatGPT Go tier at $8/month with ads testing in the US free tier, emphasizing that ads will not influence responses and will be clearly labeled. The update includes memory improvements and a "very fast Codex" feature teased by Sam Altman. The Codex CLI ecosystem now supports open-weight models with improved context length. Discussions highlight the importance of human-in-the-loop for reliability in agent orchestration and file interface improvements over traditional retrieval-augmented generation.
not much happened today
kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos
Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified.
Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.
Gemini 2.5 Computer Use preview beats Sonnet 4.5 and OAI CUA
gemini-2.5 gpt-5-pro glm-4.6 codex google-deepmind openai microsoft anthropic zhipu-ai llamaindex mongodb agent-frameworks program-synthesis security multi-agent-systems computer-use-models open-source moe developer-tools workflow-automation api vision reasoning swyx demishassabis philschmid assaf_elovic hwchase17 jerryjliu0 skirano fabianstelzer blackhc andrewyng
Google DeepMind released a new Gemini 2.5 Computer Use model for browser and Android UI control, evaluated by Browserbase. OpenAI showcased GPT-5 Pro, new developer tools including Codex with Slack integration, and agent-building SDKs at Dev Day. Google DeepMind's CodeMender automates security patching for large codebases. Microsoft introduced an open-source Agent Framework for multi-agent enterprise systems. AI community discussions highlight agent orchestration, program synthesis, and UI control advancements. GLM-4.6 update from Zhipu features a large Mixture-of-Experts model with 355B parameters.
OpenAI Realtime API GA and new `gpt-realtime` model, 20% cheaper than 4o
gpt-realtime gpt-4o-realtime grok-code-fast-1 codex mai-1-preview mai-voice-1 gemini-cli openai xai microsoft google speech-to-speech instruction-following function-calling telephony webrtc voice-agents multilingual-switching voice-control benchmarks coding-models ide-integration developer-tools model-updates swyx juberti omarsar0 reach_vb pbbakkum skcd42 mohitreddy13 cline kevinweil gdb sama _philschmid
OpenAI launched the gpt-realtime model and Realtime API to GA, featuring advanced speech-to-speech capabilities, new voices (Cedar, Marin), image input, SIP telephony, and a ~20% price cut. Benchmarks show improvements over gpt-4o-realtime on BigBench and ComplexFuncBench. xAI introduced Grok Code Fast 1, a speed-optimized coding model integrated with popular IDEs, while OpenAI Codex received major upgrades for local and cloud development workflows. Googleโs Gemini CLI improved multi-editor support, and new models like Microsoft MAI-1-preview and MAI-Voice-1 were announced. "The new all-in-one WebRTC API removes the ephemeral token step and supports video on the same connection," highlighting enhanced developer tooling.
OpenAI updates Codex, VSCode Extension that can sync tasks with Codex Cloud
codex stepwiser gemini-2.5-flash nemotron-cc-math jet-nemotron openai facebook-ai-fair google-deepmind nvidia process-reward-modeling reinforcement-learning chain-of-thought spatial-reasoning multi-image-fusion developer-tools code-review ide-extension cli cloud-computing model-efficiency jaseweston tesatory benjamindekr tokumin fabianstelzer officiallogank
OpenAI Codex has launched a new IDE Extension integrating with VS Code and Cursor, enabling seamless local and cloud task handoff, sign-in via ChatGPT plans, upgraded CLI, and GitHub code review automation. Facebook AI researchers introduced StepWiser, a process-level reward model improving reasoning and training by chunk-by-chunk evaluation, achieving SOTA on ProcessBench. Google DeepMind's Gemini 2.5 Flash Image model showcases advanced spatial reasoning, multi-image fusion, and developer tools including a browser extension for image remixing. NVIDIA revealed efficiency data on Nemotron-CC-Math (133B) and Jet-Nemotron models.
not much happened today
seedance-1.0 codex claude-code kling-2.1 veo-3 bytedance morph-labs huggingface deeplearning.ai figure-ai langchain sakana-ai video-generation autoformalization ai-assisted-coding api-design context-engineering reinforcement-learning ai-evals hypernetworks model-fine-tuning foundation-models andrew_ng hwchase17 adcock_brett clementdelangue akhaliq jxmnop hamelhusain sh_reya
Bytedance showcased an impressive state-of-the-art video generation model called Seedance 1.0 without releasing it, while Morph Labs announced Trinity, an autoformalization system for Lean. Huggingface Transformers deprecated Tensorflow/JAX support. Andrew Ng of DeepLearning.AI highlighted the rise of the GenAI Application Engineer role emphasizing skills in AI building blocks and AI-assisted coding tools like Codex and Claude Code. Engineering teams are increasingly testing API designs against LLMs for usability. Figure AI's CEO stressed speed as a key competitive advantage, and LangChain introduced the concept of Context Engineering for AI agents. Reinforcement learning on LLMs shows transformative potential, and the community values AI evals and data work. Sakana AI released Text-to-LoRA, a hypernetwork method for generating task-specific LoRA adapters from natural language, enabling efficient model customization. The video generation race heats up with Bytedance's Seed-based model praised for quality, challenging American labs, alongside models like Kling 2.1 and Veo 3.
not much happened today
codex claude-4-opus claude-4-sonnet gemini-2.5-pro gemini-2.5 qwen-2.5-vl qwen-3 playdiffusion openai anthropic google perplexity-ai bing playai suno hugging-face langchain-ai qwen mlx assemblyai llamacloud fine-tuning model-benchmarking text-to-video agentic-ai retrieval-augmented-generation open-source-models speech-editing audio-processing text-to-speech ultra-low-latency multimodality public-notebooks sama gdb kevinweil lmarena_ai epochairesearch reach_vb wightmanr deeplearningai mervenoyann awnihannun jordirib1 aravsrinivas omarsar0 lioronai jerryjliu0 nerdai tonywu_71 _akhaliq clementdelangue _mfelfel
OpenAI rolled out Codex to ChatGPT Plus users with internet access and fine-grained controls, improving memory features for free users. Anthropic's Claude 4 Opus and Sonnet models lead coding benchmarks, while Google's Gemini 2.5 Pro and Flash models gain recognition with new audio capabilities. Qwen 2.5-VL and Qwen 3 quantizations are noted for versatility and support. Bing Video Creator launched globally enabling text-to-video generation, and Perplexity Labs sees increased demand for travel search. New agentic AI tools and RAG innovations include LlamaCloud and FedRAG. Open-source releases include Holo-1 for web navigation and PlayAI's PlayDiffusion for speech editing. Audio and multimodal advances feature Suno's music editing upgrades, Google's native TTS in 24+ languages, and Universal Streaming's ultra-low latency speech-to-text. Google NotebookLM now supports public notebooks. "Codex's internet access brings tradeoffs, with explicit warnings about risk" and "Gemini 2.5 Pro is cited as a daily driver by users".
not much happened today
chatgpt o3 o4 bagel-7b medgemma acereason-nemotron-14b codex gemini openai bytedance google nvidia sakana-ai-labs deep-learning-ai gemini agenticseek anthropic agentic-systems multimodality reasoning code-generation prompt-engineering privacy ethical-ai emergence synthetic-data speech-instruction-tuning low-resource-languages humor scaling01 mervenoyann sakananailabs _philschmid omarsar0 teortaxestex andrewlampinen sedielem cis_female
OpenAI plans to evolve ChatGPT into a super-assistant by 2025 with models like o3 and o4 enabling agentic tasks and supporting a billion users. Recent multimodal and reasoning model releases include ByteDance's BAGEL-7B, Google's MedGemma, and NVIDIA's ACEReason-Nemotron-14B. The Sudoku-Bench Leaderboard highlights ongoing challenges in AI creative reasoning. In software development, OpenAI's Codex aids code generation and debugging, while Gemini's Context URL tool enhances prompt context. AgenticSeek offers a local, privacy-focused alternative for autonomous agents. Ethical concerns are raised about AGI development priorities and Anthropic's alignment with human values. Technical discussions emphasize emergence in AI and training challenges, with humor addressing misconceptions about Gemini 3.0 and async programming in C. A novel synthetic speech training method enables instruction tuning of LLMs without real speech data, advancing low-resource language support.