All tags
Company: "geminiapp"
Gemini 3.1 Pro: 2x 3.0 on ARC-AGI 2
gemini-3.1-pro gemini-3-deep-think google google-deepmind geminiapp reasoning benchmarking agentic-ai cost-efficiency hallucination code-generation model-release developer-tools sundarpichai demishassabis jeffdean koraykv noamshazeer joshwoodward artificialanlys arena oriolvinyalsml scaling01
Google released Gemini 3.1 Pro, a developer preview integrated across the Gemini app, NotebookLM, Gemini API / AI Studio, and Vertex AI, highlighting a significant reasoning improvement with ARC-AGI-2 = 77.1% and strong coding and agentic-tool benchmarks like SWE-Bench Verified = 80.6%. Independent evaluators such as Artificial Analysis and Arena confirmed top-tier performance and cost efficiency, though community reactions included excitement about practical gains, skepticism about benchmark targeting, and concerns over rollout inconsistencies. The release emphasizes the same core intelligence powering Gemini 3 Deep Think scaled for practical use, with notable mentions from leaders like @sundarpichai, @demishassabis, and @JeffDean.
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
gemini-3-deep-think-v2 arc-agi-2 google-deepmind google geminiapp arcprize benchmarking reasoning test-time-adaptation fluid-intelligence scientific-computing engineering-workflows 3d-modeling cost-analysis demishassabis sundarpichai fchollet jeffdean oriolvinyalsml tulseedoshi
Google DeepMind is rolling out the upgraded Gemini 3 Deep Think V2 reasoning mode to Google AI Ultra subscribers and opening early access to the Vertex AI / Gemini API for select users. Key benchmark achievements include ARC-AGI-2 at 84.6%, Humanity’s Last Exam (HLE) at 48.4% without tools, and a Codeforces Elo of 3455, showcasing Olympiad-level performance in physics and chemistry. The mode emphasizes practical scientific and engineering applications such as error detection in math papers, physical system modeling, semiconductor optimization, and a sketch to CAD/STL pipeline for 3D printing. ARC benchmark creator François Chollet highlights the benchmark's role in advancing test-time adaptation and fluid intelligence, projecting human-AI parity around 2030. This rollout is framed as a productized, compute-heavy test-time mode rather than a lab demo, with cost disclosures for ARC tasks provided.