Gemma 4
Google's Gemma family of open models. The newsletter references new QAT checkpoints that preserve performance while reducing memory use.
Key Highlights
- Gemma 4 is Google DeepMind’s open model family spanning small to very large multimodal models with up to 100K-token context windows.
- Newsletter coverage focused heavily on deployment efficiency, including local mobile use, RTX 4090 viability, multi-token prediction, and QAT checkpoints.
- New QAT checkpoints were reported to preserve original performance while reducing memory use by about 4×, with a mobile format shrinking Gemma 4 E2B to 1 GB.
- Gemma 4 showed strong ecosystem momentum through Vertex AI, GitHub, Hugging Face, and prominent community contributors like Philipp Schmid and Sebastian Raschka.
- For AI PMs, Gemma 4 is especially relevant when evaluating cost, latency, privacy, and offline/on-device product experiences.
Gemma 4
Overview
Gemma 4 is Google DeepMind’s family of open models, introduced as a lineup spanning roughly 7B to 196B parameters, with up to 100K-token context windows and multimodal capabilities. It is positioned as a developer-accessible model family, with weights, examples, and tutorials distributed through channels like Vertex AI, GitHub, and Hugging Face. Across newsletter coverage, Gemma 4 appears both as a flagship open-model release and as a rapidly evolving ecosystem of optimized variants, edge deployments, and community-led improvements.For AI Product Managers, Gemma 4 matters because it sits at the intersection of open-model flexibility, deployability, and cost-performance optimization. The coverage emphasizes practical advantages: local and edge inference, Apache 2.0 availability for some variants, multi-token prediction for faster generation, and new QAT checkpoints that preserve quality while dramatically reducing memory requirements. That makes Gemma 4 relevant not just as a model family, but as a platform choice for building products where latency, cost, privacy, offline operation, and hardware constraints all shape roadmap decisions.
Key Developments
- 2026-04-07 — Simon Willison highlighted Google AI Edge Gallery, Google’s iPhone app for running Gemma 4 models locally. The app supported Gemma 4 E2B and E4B, with E2B as a 2.54 GB download, plus image Q&A, short audio transcription, and interactive demos, showing early real-world mobile deployment.
- 2026-04-09 — Sundar Pichai shared that Gemma 4 surpassed 10 million downloads in its first week, contributing to more than 500 million total Gemma downloads, signaling rapid developer adoption.
- 2026-04-09 — Fireship described Gemma 4 as an Apache 2.0-licensed open-source model that could run locally on a single RTX 4090 using around 20 GB for a 31B model, attributing efficiency to TurboQuant and per-layer embeddings. The same coverage noted an edge-capable variant suitable for phones or Raspberry Pi-class devices.
- 2026-04-10 — Google DeepMind officially launched Gemma 4 as a family of 7B–196B foundation models with up to 100K-token contexts and multimodal capabilities. Developers were directed to open weights, code samples, and tutorials via Vertex AI and GitHub; Jeff Dean also amplified the launch.
- 2026-04-11 — Google AI showcased builder projects created with the newly open-sourced Gemma 4 models, including meeting summarizers, code-generation assistants, and multilingual chatbots, reinforcing product-building momentum around the model family.
- 2026-04-12 — Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, illustrating how per-layer embeddings are constructed and making the architecture more legible for practitioners.
- 2026-05-06 — Philipp Schmid released Multi-Token Prediction for Gemma 4, reporting roughly 3× faster inference with no quality loss. The update was made available for E2B and E4B variants under Apache 2.0.
- 2026-05-17 — Sebastian Raschka included Gemma 4 in a visual overview of recent LLM architectures, highlighting long-context efficiency techniques and placing Gemma 4 in the broader evolution of model design alongside systems like DeepSeek V4.
- 2026-06-09 — Philipp Schmid released new QAT Gemma 4 checkpoints that reportedly matched original performance while using about 4× less memory, plus a mobile quantization format that reduced Gemma 4 E2B’s footprint to just 1 GB. The checkpoints were published on Hugging Face and positioned as ready to run.
Relevance to AI PMs
1. Lower-cost deployment and broader device coverage Gemma 4 gives PMs a concrete option for serving AI features beyond expensive hosted APIs. The newsletter repeatedly highlights local execution, mobile-friendly formats, RTX 4090 viability, and edge experiences. For PMs, that means more room to design offline, privacy-sensitive, or low-latency product experiences.2. Optimization headroom without obvious quality tradeoffs
The strongest recurring theme is efficiency improvement: TurboQuant-style compression, per-layer embeddings, multi-token prediction, and QAT checkpoints with major memory reductions. PMs can use this as leverage in roadmap planning—testing whether a feature can move from “too expensive to ship” to “viable at scale” through model optimization rather than model replacement.
3. Strong ecosystem signals for experimentation and adoption
Gemma 4 has visible support from Google DeepMind, Google AI, Vertex AI, GitHub, Hugging Face, and independent educators/builders. For PMs, that reduces adoption risk: there are multiple distribution channels, public examples, implementation notebooks, and benchmark-style discussions that can accelerate prototyping and de-risk technical discovery.
Related
- Google DeepMind — Creator and launch source for Gemma 4.
- Google / Google AI / Sundar Pichai / Jeff Dean / Demis Hassabis — Executive and organizational voices that amplified Gemma 4 adoption, launch visibility, and ecosystem momentum.
- Vertex AI — One of the official channels mentioned for accessing Gemma 4 resources and getting started in production workflows.
- GitHub — Key distribution point for code samples, notebooks, and community implementations, including Sebastian Raschka’s from-scratch work.
- Hugging Face — Hosting destination for optimized Gemma 4 checkpoints, including QAT releases shared by Philipp Schmid.
- Philipp Schmid — Important community contributor tied to major post-launch improvements, including Multi-Token Prediction and QAT checkpoints.
- Sebastian Raschka — Helped explain Gemma 4 internals and positioned it within broader LLM architecture trends.
- Simon Willison — Documented hands-on local/mobile use via Google AI Edge Gallery.
- Google AI Edge Gallery — Demonstrates Gemma 4’s practical mobile and on-device deployment potential.
- TurboQuant — Compression technique cited in coverage explaining how larger Gemma 4 variants can run with lower memory footprints.
- RTX 4090 — Reference hardware used in discussion of practical local inference for Gemma 4.
- Multi-Token Prediction — Important performance optimization that reportedly tripled inference speed on Gemma 4 variants.
- Gemma 3 — Preceding model family, also present in some edge app experiences, useful as a comparison point for capability and deployment evolution.
- DeepSeek V4 — Mentioned alongside Gemma 4 in architecture comparisons, providing context for competitive model design trends.
Newsletter Mentions (12)
“𝕏 Philipp Schmid released new QAT Gemma 4 checkpoints that match original performance while using ~4× less memory, plus a mobile quantization format shrinking Gemma 4 E2B’s footprint to just 1 GB.”
GenAI PM Daily June 09, 2026 GenAI PM Daily 🎧 Listen to this brief 3 min listen Today's top 25 insights for PM Builders, ranked by relevance from X, Blogs, and YouTube. NotebookLM update adds PDF, DOCX, XLSX, PPTX exports and chart support for better research #1 𝕏 Philipp Schmid released new QAT Gemma 4 checkpoints that match original performance while using ~4× less memory, plus a mobile quantization format shrinking Gemma 4 E2B’s footprint to just 1 GB. They’re now available on Hugging Face and ready to run. #2 𝕏 NVIDIA AI shows how to train models faster with JAX and MaxText using NVFP4 precision on NVIDIA Blackwell GPUs, sharing detailed benchmarks, a full recipe breakdown, and a MaxText example. #3 𝕏 Cognition launched FrontierCode, a coding evaluation platform setting a new standard in difficulty and quality with each task crafted over 40+ hours by top open-source maintainers. #4 𝕏 Josh Woodward unveiled a new NotebookLM feature that lets you expand searches beyond your own source files. Today’s update adds export options—PDF, DOCX, XLSX, PPTX and charts—to help you do better research.
“#4 𝕏 Sebastian Raschka presents a visual overview of recent LLM architectures—from Gemma 4 to DeepSeek V4—showcasing long-context efficiency tweaks.”
Today's top 13 insights for PM Builders, ranked by relevance from X, Blogs, and LinkedIn. Why LLM features need end-to-end observability metrics #1 𝕏 Boris Cherny upgraded /usage to show personalized token usage by plugin, skill, and parallel agent, so you can pinpoint high-consumption drivers and maximize your doubled rate limits. #2 𝕏 xAI integrates X Premium subscriptions into Hermes Agent and equips it with native search across X posts. #3 📝 PromptLayer Blog A deep dive into LLM observability tools - Discusses the need for observability when shipping LLM-powered features, since models can return confidently wrong answers while logs show successful API responses. Argues observability must connect inputs, outputs, latency, cost, and quality to diagnose real production issues. #4 𝕏 Sebastian Raschka presents a visual overview of recent LLM architectures—from Gemma 4 to DeepSeek V4—showcasing long-context efficiency tweaks.
“Philipp Schmid launched Multi-Token Prediction for Gemma 4, tripling inference speed with zero quality loss—now available E2B/E4B under Apache 2.0.”
#13 𝕏 Philipp Schmid launched Multi-Token Prediction for Gemma 4, tripling inference speed with zero quality loss—now available E2B/E4B under Apache 2.0. #14 𝕏 Philipp Schmid outlines four subagent coordination patterns—tool calls, spawns, pools, and teams—to structure multi-agent workflows.
“Open-Source Gemma 4 Embedding Demo Available #1 𝕏 Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, demonstrating how per-layer embeddings are built.”
Open-Source Gemma 4 Embedding Demo Available #1 𝕏 Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, demonstrating how per-layer embeddings are built.
“Google AI spotlights fun builder projects powered by last week’s open-source Gemma 4 models.”
#9 𝕏 Google AI spotlights fun builder projects powered by last week’s open-source Gemma 4 models. Examples include automated meeting summarizers, code-generation assistants, and multilingual chatbots, each detailed with tool integrations, performance stats, and user insights.
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities.”
#2 𝕏 Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps. Also covered by: @Jeff Dean
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps.”
#2 𝕏 Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps. Also covered by: @Jeff Dean
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities.”
Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps.
“#6 𝕏 Sundar Pichai announces that Gemma 4 has exceeded 10 million downloads in its first week, pushing total Gemma model downloads past 500 million, and shares excitement to see what users build next. #7 ▶️ Google just casually disrupted the open-source AI narrative… Fireship Google’s Gemma 4 is a 31 billion-parameter, Apache 2.0-licensed open-source LLM that runs locally in 20 GB on an RTX 4090 by using TurboQuant and per-layer embeddings for compression.”
Today's top 25 insights for PM Builders, ranked by relevance from Blogs, X, YouTube, and LinkedIn. Anthropic Scales Managed Agents #1 📝 Anthropic Engineering Scaling Managed Agents: Decoupling the brain from the hands - This article describes an approach to scale managed agents by separating decision-making (the 'brain') from execution (the 'hands'), enabling better scalability and modularity of agentic systems. It outlines architectural patterns for building managed-agent platforms. #2 📝 OpenAI News The next phase of enterprise AI - OpenAI announces the next phase of its enterprise AI strategy, describing initiatives to accelerate adoption of advanced AI capabilities across businesses and enterprises. #3 𝕏 Sundar Pichai announced Notebooks are now rolling out in the Gemini app for Google AI Ultra, Pro, and Plus web subscribers, letting users organize conversations, notes, and project sources. The feature integrates with NotebookLM for seamless deep dives. #4 𝕏 Philipp Schmid rolled out Flex and Priority `service_tiers` for the Gemini API—Flex inference (`service_tier="flex"`) cuts costs by 50% on latency-tolerant workloads, while Priority (`service_tier="priority"`) guarantees low-latency with automatic fallback to Standard, all vi... #5 𝕏 AI at Meta unveiled Muse Spark, a multimodal model built from the ground up to integrate visual and textual data for richer AI understanding. #6 𝕏 Sundar Pichai announces that Gemma 4 has exceeded 10 million downloads in its first week, pushing total Gemma model downloads past 500 million, and shares excitement to see what users build next. Also covered by: @Santiago #7 ▶️ Google just casually disrupted the open-source AI narrative… Fireship Google’s Gemma 4 is a 31 billion-parameter, Apache 2.0-licensed open-source LLM that runs locally in 20 GB on an RTX 4090 by using TurboQuant and per-layer embeddings for compression. Gemma 4 big model (31 B parameters) downloads in 20 GB and delivers ~10 tokens/sec on a single RTX 4090, while its Edge variant can run on a phone or Raspberry Pi. TurboQuant compresses model weights by converting Cartesian data to polar coordinates and applying the Johnson–Lindenstrauss transform to quantize values to single sign bits while preserving distances. Models named E2B and E4B use “effective parameters” via per-layer embeddings, giving each transformer layer its own token embedding to introduce information exactly when needed. Also covered by: @Santiago ...
“Google's official iPhone app for running Gemma 4 models (E2B, E4B and some Gemma 3 family) works very well locally, with the E2B model a 2.54GB download; it supports image Q&A, short audio transcription and an interactive "skills" demo but conversations are ephemeral and the app lacks permanent logs.”
#2 📝 Simon Willison Google AI Edge Gallery - Google's official iPhone app for running Gemma 4 models (E2B, E4B and some Gemma 3 family) works very well locally, with the E2B model a 2.54GB download; it supports image Q&A, short audio transcription and an interactive "skills" demo but conversations are ephemeral and the app lacks permanent logs.
Related
A developer and commentator who writes frequently about AI tools, APIs, and product updates. In this newsletter he discusses OpenAI audio sessions and Anthropic’s export-control statement.
An AI community builder and benchmark author mentioned here for launching Agents’ Last Exam. For PMs, his work centers on rigorous evaluation of autonomous agents across real-world tasks.
Google DeepMind develops advanced AI models and applied programs, including robotics initiatives. The newsletter highlights its accelerator program for European startups using Gemini Robotics models.
Hugging Face is the AI platform/community referenced in connection with local MLX serving and benchmark analysis. It appears here both as an organization and as a context for open tooling around models.
An AI researcher and educator known for model architecture analysis and practical machine learning explanations. Here he is cited for introducing Cohere’s new model and for commentary on sparse versus dense models.
The technology company behind Google AI Studio. It appears here in the context of Logan Kilpatrick’s comments on reducing friction in AI building.
Co-founder and CEO of Google DeepMind, cited unveiling DiffusionGemma. His mention ties Google’s research leadership to model launches.
Senior Google AI leader known for influential model and infrastructure work. In this newsletter, he is credited with unveiling Gemma 4 12B.
CEO of Google and Alphabet, mentioned here in connection with Gemini/DiffusionGemma announcements and open-sourcing model weights.
The software development platform where ClawSweeper is hosted. In this issue it appears as the project home for an open-source triage tool.
Google Cloud’s managed AI platform for deploying and serving models. It is mentioned as the availability layer for Gemini 3.5 Flash.
Google AI Edge Gallery is a Google tool for showcasing and running on-device AI experiences at the edge, including offline use cases.
A model referenced in the newsletter’s overview of recent LLM architectures. It appears here as an example of architecture-level innovation and efficiency work in foundation models.
A model family from Google used as the base for TranslateGemma. It matters to PMs as an example of reusing a foundation model for a specialized, deployable product.
A compression algorithm for LLM inference that reduces key-value cache memory and speeds up inference. It is relevant to AI PMs concerned with performance, cost, and latency tradeoffs.
Stay updated on Gemma 4
Get curated AI PM insights delivered daily — covering this and 1,000+ other sources.
Subscribe Free