All tags
Person: "ggerganov"
not much happened today
gemma-4 google huggingface intel ollama unsloth reasoning agentic-workflows multimodality on-device-ai local-inference model-benchmarking moe vision audio-processing memory-optimization open-source model-performance fchollet demishassabis clementdelangue quixiai googlegemma ggerganov osanseviero maartengr basecampbernie prince_canuma measure_plan kimmonismus anemll arena stochasticchasm reach_vb zeneca everlier erick_lindberg_ anomalistg
Gemma 4 was launched by Google under an Apache 2.0 license, marking a significant open-model release focused on reasoning, agentic workflows, multimodality, and on-device use. It outperforms models 10x larger and has immediate ecosystem support including vLLM, llama.cpp, Ollama, Intel hardware, Unsloth, and Hugging Face Inference Endpoints. Local inference benchmarks showed strong performance on consumer hardware, including RTX 4090 and Mac mini M4. Early benchmarking praised its efficiency and ranking improvements over previous versions. Meanwhile, Hermes Agent emerged as a popular open-source agent harness, noted for stability and capability on long tasks, with users switching from OpenClaw to Hermes.
Gemma 4
gemma-4 gemma-4-31b gemma-4-26b-a4b google-deepmind multimodality long-context model-architecture moe local-inference model-optimization function-calling quantization jeffdean _philschmid rasbt ggerganov clattner_llvm julien_c clementdelangue
Google DeepMind released Gemma 4, a family of open-weight, multimodal models with long-context support up to 256K tokens under an Apache 2.0 license, marking a major capability and licensing shift. The lineup includes 31B dense, 26B MoE (A4B), and two edge models (E4B, E2B) optimized for local and edge deployment with native multimodal support (text, vision, audio). Early benchmarks show Gemma-4-31B ranking #3 among open models and strong scientific reasoning performance with 85.7% GPQA Diamond. Day-0 ecosystem support includes llama.cpp, Ollama, vLLM, and LM Studio, with notable local inference performance on hardware like M2 Ultra and RTX 4090. The architecture features hybrid attention and MoE layering, diverging from standard transformers. Community and developer engagement is high, with rapid adoption and tooling integration.
not much happened today
fastvlm mobileclip2 grok-code-fast-1 gpt-5 qwen-3-coder-30b-a3b apple hugging-face x-ai openai groq run-llama lmstudio vision model-quantization code-generation cli-workflows retrieval-augmentation embedding-models local-ai multimodality reach_vb xenovacom pcuenq awnihannun cline veggie_eric nickbaumann_ gdb benankdev loganmarkewich tom_doerr fastmcp ggerganov orionweller antoine_chaffin
Apple released three real-time vision-language models (FastVLM, MobileCLIP2) on Hugging Face with significant speed and size improvements, supporting WebGPU and Core ML. Their MLX framework now supports MXFP4 format, competing with NVFP4 for FP4 quantization. xAI launched grok-code-fast-1, outperforming Claude for code edits, while OpenAI integrated GPT-5 into Xcode 26 and released a new Responses API on Groq hardware. CLI-first agent workflows advanced with tools like SemTools, MLX local runner for Apple Silicon, and llama.vim recommending Qwen 3 Coder 30B A3B. Retrieval research highlights limitations of single-vector embeddings, promoting ColBERT-style late interaction.
Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500
deepseek-r1 qwq gpt-4o claude-3.5-sonnet qwen-2.5 llama-cpp deepseek sambanova hugging-face dair-ai model-releases benchmarking fine-tuning sequential-search inference model-deployment agentic-rag external-tools multi-modal-models justin-lin clementdelangue ggerganov vikparuchuri
DeepSeek r1 leads the race for "open o1" models but has yet to release weights, while Justin Lin released QwQ, a 32B open weight model that outperforms GPT-4o and Claude 3.5 Sonnet on benchmarks. QwQ appears to be a fine-tuned version of Qwen 2.5, emphasizing sequential search and reflection for complex problem-solving. SambaNova promotes its RDUs as superior to GPUs for inference tasks, highlighting the shift from training to inference in AI systems. On Twitter, Hugging Face announced CPU deployment for llama.cpp instances, Marker v1 was released as a faster and more accurate deployment tool, and Agentic RAG developments focus on integrating external tools and advanced LLM chains for improved response accuracy. The open-source AI community sees growing momentum with models like Flux gaining popularity, reflecting a shift towards multi-modal AI models including image, video, audio, and biology.