Gemma 4
A Gemma model referenced alongside Multi-Token Prediction, with variants E2B/E4B. Important for PMs interested in open models and inference optimization.
Key Highlights
- Gemma 4 is an Apache 2.0–licensed open model family from Google DeepMind aimed at flexible deployment across cloud, local, and edge environments.
- E2B and E4B variants made Gemma 4 especially notable for practical on-device and mobile inference use cases.
- Coverage emphasized deployment efficiency features such as TurboQuant, per-layer embeddings, and later Multi-Token Prediction.
- Philipp Schmid’s Multi-Token Prediction release for Gemma 4 reportedly tripled inference speed without reducing output quality.
- For AI PMs, Gemma 4 is most relevant as a tool for balancing openness, performance, cost control, and edge product experiences.
Gemma 4
Overview
Gemma 4 is Google DeepMind’s family of Apache 2.0–licensed open models, positioned as a foundation-model lineup for developers who want to run capable models on their own hardware or through Google tooling such as Vertex AI and GitHub starter resources. Newsletter mentions describe the family as spanning roughly 7B to 196B parameters, with multimodal capabilities and context windows up to 100K tokens. Variants such as E2B and E4B were repeatedly highlighted because they make Gemma 4 more practical for local and edge deployment.For AI Product Managers, Gemma 4 matters because it sits at the intersection of open-model strategy, deployability, and inference efficiency. The coverage around Gemma 4 was not just about model quality; it emphasized product realities such as local inference on phones, compression techniques like TurboQuant and per-layer embeddings, and later Multi-Token Prediction that reportedly tripled inference speed with no quality loss. That combination makes Gemma 4 especially relevant for PMs evaluating cost, latency, privacy, offline UX, and control over deployment.
Key Developments
- 2026-04-03: Google DeepMind introduced Gemma 4 as a family of Apache 2.0–licensed open models designed for advanced reasoning and agentic workflows on user-controlled hardware.
- 2026-04-06: Google AI Edge Gallery for iPhone showcased local Gemma 4 inference, including image question answering, short audio transcription, and a "skills" demo for tool-calling via HTML widgets.
- 2026-04-07: Simon Willison highlighted the iPhone experience further, noting that Gemma 4 E2B was a 2.54GB local download and worked well, though conversations were ephemeral and lacked permanent logs.
- 2026-04-09: Sundar Pichai said Gemma 4 exceeded 10 million downloads in its first week, pushing total Gemma downloads past 500 million.
- 2026-04-09: Fireship coverage framed Gemma 4 as a major open-source release, noting a 31B-parameter model that could run locally in about 20 GB on an RTX 4090 using TurboQuant and per-layer embeddings; Edge variants were described as lightweight enough for phones or Raspberry Pi–class devices.
- 2026-04-10: Google DeepMind formally launched Gemma 4 as a lineup of 7B–196B-parameter multimodal foundation models with up to 100K-token contexts, with access through Vertex AI and GitHub resources. Jeff Dean also amplified the release.
- 2026-04-11: Google AI highlighted builder projects using Gemma 4, including meeting summarizers, code-generation assistants, and multilingual chatbots, illustrating practical application patterns and integrations.
- 2026-04-12: Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, demonstrating how per-layer embeddings are constructed.
- 2026-05-06: Philipp Schmid launched Multi-Token Prediction for Gemma 4, reporting 3x inference speed improvements with no quality loss, available for E2B and E4B under Apache 2.0.
Relevance to AI PMs
1. Useful for open-model product strategy: Gemma 4 gives PMs a credible open alternative for products where licensing flexibility, self-hosting, or hardware-level control matters. This is especially relevant for enterprise, regulated, or privacy-sensitive use cases.2. Strong fit for latency and cost optimization: Mentions of TurboQuant, per-layer embeddings, and Multi-Token Prediction point to a practical optimization story. PMs can use Gemma 4 to test whether smaller or compressed variants can meet UX requirements at materially lower inference cost.
3. Enables on-device and edge experiences: The repeated coverage of E2B/E4B, iPhone inference, and edge demos suggests Gemma 4 is relevant for offline assistants, mobile copilots, or embedded AI products where cloud dependence is a product risk.
Related
- Google DeepMind / Google / Sundar Pichai / Demis Hassabis / Jeff Dean: Core organizations and leaders associated with Gemma 4’s launch, promotion, and strategic positioning in the open-model ecosystem.
- Vertex AI: One of the main distribution and developer access points for Gemma 4, relevant for PMs considering managed deployment paths.
- GitHub: Key channel for code samples, notebooks, and open-source implementations, including educational demos around Gemma 4 internals.
- Philipp Schmid: Notable for shipping Multi-Token Prediction support for Gemma 4, making the model family more attractive for inference-sensitive applications.
- Sebastian Raschka: Shared a from-scratch Gemma 4 E2B implementation that helped explain architectural details like per-layer embeddings.
- Simon Willison: Covered the real-world mobile experience of running Gemma 4 locally through Google AI Edge Gallery.
- Google AI Edge Gallery: Demonstrated Gemma 4 in local mobile inference scenarios, including multimodal and tool-calling behaviors.
- Gemma 3: Prior Gemma family generation that provides context for product evolution and model positioning.
- Multi-Token Prediction: A major optimization linked to Gemma 4 that materially improves inference throughput.
- TurboQuant / RTX 4090: Frequently cited in discussions of Gemma 4’s compression and local deployment practicality.
Newsletter Mentions (10)
“Philipp Schmid launched Multi-Token Prediction for Gemma 4, tripling inference speed with zero quality loss—now available E2B/E4B under Apache 2.0.”
#13 𝕏 Philipp Schmid launched Multi-Token Prediction for Gemma 4, tripling inference speed with zero quality loss—now available E2B/E4B under Apache 2.0. #14 𝕏 Philipp Schmid outlines four subagent coordination patterns—tool calls, spawns, pools, and teams—to structure multi-agent workflows.
“Open-Source Gemma 4 Embedding Demo Available #1 𝕏 Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, demonstrating how per-layer embeddings are built.”
Open-Source Gemma 4 Embedding Demo Available #1 𝕏 Sebastian Raschka shared a from-scratch Jupyter Notebook implementation of Gemma 4 E2B on GitHub, demonstrating how per-layer embeddings are built.
“Google AI spotlights fun builder projects powered by last week’s open-source Gemma 4 models.”
#9 𝕏 Google AI spotlights fun builder projects powered by last week’s open-source Gemma 4 models. Examples include automated meeting summarizers, code-generation assistants, and multilingual chatbots, each detailed with tool integrations, performance stats, and user insights.
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities.”
#2 𝕏 Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps. Also covered by: @Jeff Dean
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps.”
#2 𝕏 Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps. Also covered by: @Jeff Dean
“Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities.”
Google DeepMind launched Gemma 4, a lineup of 7B–196B-parameter foundation models with up to 100K-token contexts and multimodal capabilities. Developers can now access open-source weights, code samples, and tutorials via Vertex AI and GitHub to jumpstart building AI apps.
“#6 𝕏 Sundar Pichai announces that Gemma 4 has exceeded 10 million downloads in its first week, pushing total Gemma model downloads past 500 million, and shares excitement to see what users build next. #7 ▶️ Google just casually disrupted the open-source AI narrative… Fireship Google’s Gemma 4 is a 31 billion-parameter, Apache 2.0-licensed open-source LLM that runs locally in 20 GB on an RTX 4090 by using TurboQuant and per-layer embeddings for compression.”
Today's top 25 insights for PM Builders, ranked by relevance from Blogs, X, YouTube, and LinkedIn. Anthropic Scales Managed Agents #1 📝 Anthropic Engineering Scaling Managed Agents: Decoupling the brain from the hands - This article describes an approach to scale managed agents by separating decision-making (the 'brain') from execution (the 'hands'), enabling better scalability and modularity of agentic systems. It outlines architectural patterns for building managed-agent platforms. #2 📝 OpenAI News The next phase of enterprise AI - OpenAI announces the next phase of its enterprise AI strategy, describing initiatives to accelerate adoption of advanced AI capabilities across businesses and enterprises. #3 𝕏 Sundar Pichai announced Notebooks are now rolling out in the Gemini app for Google AI Ultra, Pro, and Plus web subscribers, letting users organize conversations, notes, and project sources. The feature integrates with NotebookLM for seamless deep dives. #4 𝕏 Philipp Schmid rolled out Flex and Priority `service_tiers` for the Gemini API—Flex inference (`service_tier="flex"`) cuts costs by 50% on latency-tolerant workloads, while Priority (`service_tier="priority"`) guarantees low-latency with automatic fallback to Standard, all vi... #5 𝕏 AI at Meta unveiled Muse Spark, a multimodal model built from the ground up to integrate visual and textual data for richer AI understanding. #6 𝕏 Sundar Pichai announces that Gemma 4 has exceeded 10 million downloads in its first week, pushing total Gemma model downloads past 500 million, and shares excitement to see what users build next. Also covered by: @Santiago #7 ▶️ Google just casually disrupted the open-source AI narrative… Fireship Google’s Gemma 4 is a 31 billion-parameter, Apache 2.0-licensed open-source LLM that runs locally in 20 GB on an RTX 4090 by using TurboQuant and per-layer embeddings for compression. Gemma 4 big model (31 B parameters) downloads in 20 GB and delivers ~10 tokens/sec on a single RTX 4090, while its Edge variant can run on a phone or Raspberry Pi. TurboQuant compresses model weights by converting Cartesian data to polar coordinates and applying the Johnson–Lindenstrauss transform to quantize values to single sign bits while preserving distances. Models named E2B and E4B use “effective parameters” via per-layer embeddings, giving each transformer layer its own token embedding to introduce information exactly when needed. Also covered by: @Santiago ...
“Google's official iPhone app for running Gemma 4 models (E2B, E4B and some Gemma 3 family) works very well locally, with the E2B model a 2.54GB download; it supports image Q&A, short audio transcription and an interactive "skills" demo but conversations are ephemeral and the app lacks permanent logs.”
#2 📝 Simon Willison Google AI Edge Gallery - Google's official iPhone app for running Gemma 4 models (E2B, E4B and some Gemma 3 family) works very well locally, with the E2B model a 2.54GB download; it supports image Q&A, short audio transcription and an interactive "skills" demo but conversations are ephemeral and the app lacks permanent logs.
“Google AI Edge Gallery - Google's official app for running Gemma 4 models on iPhone provides fast, useful local inference (notably the E2B model) plus image question answering, short audio transcription, and an interesting 'skills' demo showing tool-calling via HTML widgets.”
Google Launches AI Edge Gallery App for iPhone #1 📝 Simon Willison Google AI Edge Gallery - Google's official app for running Gemma 4 models on iPhone provides fast, useful local inference (notably the E2B model) plus image question answering, short audio transcription, and an interesting 'skills' demo showing tool-calling via HTML widgets. The app works well but conversations are ephemeral and it lacks permanent logs.
“Google DeepMind launched Gemma 4, a family of Apache 2.0–licensed open models you can run on your own hardware for advanced reasoning and agentic workflows.”
Google DeepMind Releases Gemma 4 Open Models #1 𝕏 Google DeepMind launched Gemma 4, a family of Apache 2.0–licensed open models you can run on your own hardware for advanced reasoning and agentic workflows. Also covered by: @Sebastian Raschka , @Simon Willison , @Philipp Schmid , @Jeff Dean , @Google DeepMind , @Demis Hassabis , @Demis Hassabis , @Sebastian Raschka #2 𝕏 Qwen unveiled Qwen3.6-Plus, a next-gen multimodal agentic model with smarter, faster coding execution, sharper vision reasoning and a 1M-token context window by default via API, all while maintaining top-tier general performance.
Related
An independent developer and writer known for commentary on AI tools and workflows. He appears twice here with live notes from Anthropic’s Code w/ Claude event and a reflection on vibe coding.
A Google AI advocate and product-minded AI practitioner who shares model launches and agent workflow patterns. Relevant to PMs for performance and orchestration insights.
Google’s AI research organization. It is mentioned here in a partnership with EVE Online developers to use the game as an AI-agent sandbox.
A major technology company building AI models, APIs, and product surfaces like Gemini and Pomelli. In PM work, it is central to developer and consumer AI distribution.
AI researcher and educator known for practical machine learning content. In this newsletter he is credited with sharing a from-scratch Gemma 4 notebook on GitHub.
CEO/cofounder of DeepMind, cited here recapping AGI milestones and highlighting agents with memory and continual learning.
Google Research/AI leader known for technical announcements around model deployment and infrastructure. Here, he is cited for announcing Gemini-powered translations in Google Search.
CEO of Google and a frequent public face for the company’s AI product launches. In this newsletter, he is credited with rolling out new Gemini capabilities and reporting Google’s quarterly AI-driven results.
GitHub is the company behind Copilot and the platform hosting related repositories and workflows. It is relevant here for plan changes and product packaging in AI coding.
Google Cloud’s AI platform, mentioned as a distribution and deployment surface for MedGemma 1.5.
Google AI Edge Gallery is a Google tool for showcasing and running on-device AI experiences at the edge, including offline use cases.
A model family from Google used as the base for TranslateGemma. It matters to PMs as an example of reusing a foundation model for a specialized, deployable product.
A compression algorithm for LLM inference that reduces key-value cache memory and speeds up inference. It is relevant to AI PMs concerned with performance, cost, and latency tradeoffs.
Stay updated on Gemma 4
Get curated AI PM insights delivered daily — covering this and 1,000+ other sources.
Subscribe Free