All tags
Company: "arcprize"
not much happened today
arc-agi-3 claude-code anthropic langchain arcprize primeintellect agentic-reasoning interactive-environments benchmarking efficiency-metrics zero-preparation-generalization agent-infrastructure trainable-agents classifier-approval fchollet mikeknoop scaling01 _rockt mark_k andykonwinski bradenjhancock jeremyphoward togelius bracesproul hwchase17 caspar_br _catwu
ARC-AGI-3 benchmark introduced by @arcprize and François Chollet resets the frontier for general agentic reasoning with humans solving 100% of tasks versus under 1% for current models, focusing on zero-preparation generalization and human-like learning efficiency. The scoring protocol sparked debate over its harsh efficiency-based metric compared to prior ARC versions and other benchmarks like NetHack. The community acknowledges the benchmark highlights weaknesses in current LLM agents in interactive, sparse-feedback environments. Concurrently, agent infrastructure advances with LangChain launching Fleet shareable skills for reusable domain knowledge, and Anthropic revealing Claude Code auto mode for classifier-mediated approval balancing autonomy and manual confirmation. Browser and coding agents are evolving into trainable systems beyond prompt wrappers, exemplified by BrowserBase and Prime Intellect collaboration.
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
gemini-3-deep-think-v2 arc-agi-2 google-deepmind google geminiapp arcprize benchmarking reasoning test-time-adaptation fluid-intelligence scientific-computing engineering-workflows 3d-modeling cost-analysis demishassabis sundarpichai fchollet jeffdean oriolvinyalsml tulseedoshi
Google DeepMind is rolling out the upgraded Gemini 3 Deep Think V2 reasoning mode to Google AI Ultra subscribers and opening early access to the Vertex AI / Gemini API for select users. Key benchmark achievements include ARC-AGI-2 at 84.6%, Humanity’s Last Exam (HLE) at 48.4% without tools, and a Codeforces Elo of 3455, showcasing Olympiad-level performance in physics and chemistry. The mode emphasizes practical scientific and engineering applications such as error detection in math papers, physical system modeling, semiconductor optimization, and a sketch to CAD/STL pipeline for 3D printing. ARC benchmark creator François Chollet highlights the benchmark's role in advancing test-time adaptation and fluid intelligence, projecting human-AI parity around 2030. This rollout is framed as a productized, compute-heavy test-time mode rather than a lab demo, with cost disclosures for ARC tasks provided.