GenAI PM
tool9 mentions· Updated May 19, 2026

Gemini Interactions API

The Gemini Interactions API is a Google Gemini interface for building streaming applications. The newsletter highlights a guide focused on making streaming easier for agents and developers.

Key Highlights

  • Gemini Interactions API is positioned as Google’s interface for building streaming, multimodal, agentic applications.
  • Recent updates introduced structured interaction steps, thought handling, encrypted signatures, and improved context management.
  • The API supports multimodal inputs and outputs, including images, documents, and long-form video understanding.
  • Its growing ecosystem includes TypeScript agent frameworks, CLI-installable skills, and practical implementation guides from Philipp Schmid.
  • For AI PMs, the API is especially relevant for real-time UX, multimodal roadmap planning, and reducing agent orchestration complexity.

Gemini Interactions API

Overview

The Gemini Interactions API is Google’s interface for building agentic and streaming applications on top of Gemini models. Across the newsletter coverage, it is positioned as a developer-facing API that supports multimodal inputs, tool and function calling, streaming responses, long-context understanding, and more structured interaction patterns for agents. It has also been referred to as the Gemini API, Deep Research API, and Interactions API in related coverage.

For AI Product Managers, this matters because the API appears to be evolving quickly toward production-grade agent workflows: multimodal input handling, image-aware function calling, structured step-based conversations, stateful and stateless operation modes, and better streaming ergonomics. Those capabilities directly affect product decisions around UX, latency, orchestration, cost controls, and how much complexity teams can offload from custom middleware into the model interface itself.

Key Developments

  • 2026-01-07: In beta, the Gemini Interactions API was highlighted as supporting multimodal inputs including images, PDFs, CSVs, and custom data via Deep Research.
  • 2026-02-14: Philipp Schmid shared that the API added multimodal function calling, allowing agents to see, process, and return real images alongside text outputs using Gemini 3 capabilities.
  • 2026-02-17: Philipp Schmid introduced a minimal TypeScript agent framework for the API, split into agents-core for the core loop, streaming events, and tool calling, and agent for higher-level tools, hooks, sessions, skills, and subagents.
  • 2026-03-05: A new Gemini Interactions API skill for building advanced agentic apps was launched, with installation via Vercel or Context7AI CLIs.
  • 2026-03-10: The API was showcased as being able to process minutes to hours of YouTube video content in seconds with a single API call, signaling strong long-video understanding for multimodal applications.
  • 2026-03-17: Philipp Schmid published a Nano Banana 2 guide using the Gemini Interactions API across use cases such as text-to-image poster generation, Web Search grounding, Image Search, and reference-based image workflows.
  • 2026-05-08: The interaction model was updated to replace fixed `user`/`model` roles with discrete steps such as `user_input`, `thought`, `function_call`, `tool_call`, and `model_output`, while also consolidating `response_format` controls.
  • 2026-05-13: New guidance covered `thought` steps and encrypted signatures, plus stateful vs. stateless modes, model switching, and context management for agent development.
  • 2026-05-19: Philipp Schmid published a guide focused on streaming in the Gemini Interactions API, emphasizing easier implementation of streaming applications for agents and developers.

Relevance to AI PMs

1. Designing better real-time UX: The API’s emphasis on streaming makes it relevant for products where perceived latency matters, such as copilots, chat interfaces, and agent dashboards. PMs can use these capabilities to specify partial responses, progressive rendering, and interruptible workflows instead of waiting for full completions.

2. Planning multimodal and agent roadmaps: Support for images, PDFs, CSVs, long video, and multimodal function calling means PMs can define broader workflows inside one interface instead of stitching together many specialized services. This is especially useful for research agents, content tools, customer support automation, and media-heavy workflows.

3. Reducing orchestration complexity: The move from simple roles to structured steps, plus support for stateful/stateless modes and context management, gives PMs clearer primitives for tool use, reasoning visibility, and model switching. That can simplify requirements for engineering teams building robust agent systems with auditability and extensibility.

Related

  • Google / Gemini: The API is part of the broader Google Gemini ecosystem and reflects Google’s push toward multimodal, agent-friendly model interfaces.
  • Philipp Schmid / Phil Schmid: The most frequent source of updates and guides in the newsletter, often demonstrating new capabilities and implementation patterns.
  • Logan Kilpatrick: Related Gemini API coverage included billing and spend-cap controls, relevant for production rollout and cost governance.
  • nano-banana-2: Demonstrated as a use case built with the Gemini Interactions API for image generation and grounded multimodal workflows.
  • agents-core / agent: Lightweight TypeScript frameworks built around the API, showing how developers can structure streaming loops, tools, sessions, and subagents.
  • gemini-3-deep-think / deep-research: Closely associated capabilities and naming in the ecosystem, especially around multimodal input handling and richer research-style workflows.

Newsletter Mentions (9)

2026-05-19
Philipp Schmid published a new guide for streaming in the Gemini Interactions API to make building streaming applications super easy.

#12 𝕏 Philipp Schmid published a new guide for streaming in the Gemini Interactions API to make building streaming applications super easy. Just point your agent to it and let it handle the rest.

2026-05-13
#7 𝕏 Philipp Schmid published a guide for Gemini Interactions API’s new `thought` steps and encrypted signatures, detailing stateful vs. stateless modes, seamless model switching, and effortless context management to supercharge agent development.

#7 𝕏 Philipp Schmid published a guide for Gemini Interactions API’s new `thought` steps and encrypted signatures, detailing stateful vs. stateless modes, seamless model switching, and effortless context management to supercharge agent development. Also covered by: @Sundar Pichai

2026-05-08
Philipp Schmid updated the Gemini Interactions API to replace rigid `user`/`model` roles with discrete “steps” (user_input, thought, function_call, tool_call, model_output, etc.), consolidated response_format controls, and added a toggle in the docs.

The newsletter highlights a structural change to Gemini’s interaction model for developers.

2026-03-17
#7 𝕏 Philipp Schmid wrote a developer guide for Nano Banana 2 with the Gemini Interactions API, walking through four use cases: text-to-image photorealistic Kyoto travel poster generation, Web Search grounding with real landmark facts, Image Search for accurate photos, and referen...

Today's top 25 insights for PM Builders, ranked by relevance from Blogs, X, YouTube, and LinkedIn. #7 𝕏 Philipp Schmid wrote a developer guide for Nano Banana 2 with the Gemini Interactions API, walking through four use cases: text-to-image photorealistic Kyoto travel poster generation, Web Search grounding with real landmark facts, Image Search for accurate photos, and referen... #8 𝕏 Logan Kilpatrick explains that the Gemini API offers two cost-control levers—global billing account caps to cap overall spend and user-set spend caps to limit individual usage—detailing how each works to manage billing.

2026-03-10
#10 𝕏 Philipp Schmid shows how the Gemini Interactions API can process minutes to hours of YouTube video content in seconds with a single API call, highlighting a major leap in video understanding.

The newsletter presents the API as a breakthrough for processing and understanding long video content. It is framed as a practical capability for builders working with multimodal data.

2026-03-05
Philipp Schmid launched a new Gemini Interactions API skill for building advanced agentic apps with Gemini models, installable globally via the Vercel or Context7AI CLIs.

#6 𝕏 Philipp Schmid launched a new Gemini Interactions API skill for building advanced agentic apps with Gemini models, installable globally via the Vercel or Context7AI CLIs.

2026-02-17
Philipp Schmid built a minimal TypeScript agent framework for the Gemini Interactions API, split into agents-core (~500 LOC for a clean loop, streaming events and tool calling) and agent (built-in tools, hooks, sessions, skills & subagents).

#3 𝕏 Philipp Schmid built a minimal TypeScript agent framework for the Gemini Interactions API, split into agents-core (~500 LOC for a clean loop, streaming events and tool calling) and agent (built-in tools, hooks, sessions, skills & subagents). #4 ▶️ Claude Code built me a $273/Day online directory Greg Isenberg Frey Chu uses Claude Code, Outscraper, Crawl for AI and Claude Vision to automate scraping, cleaning and enriching 71,000 Google Maps entries into a luxury restroom trailers directory of 725 listings in four days for under $250.

2026-02-14
Philipp Schmid announced the Gemini Interactions API now supports multimodal function calling, letting agents natively see, process, and return real images (not just text) with Gemini 3’s image processing and mixed text/image outputs.

#2 𝕏 Philipp Schmid announced the Gemini Interactions API now supports multimodal function calling, letting agents natively see, process, and return real images (not just text) with Gemini 3’s image processing and mixed text/image outputs. Also covered by: @Jeff Dean

2026-01-07
Phil Schmid @_philschmid shared that Gemini Interactions API (beta) now supports multimodal inputs like images, PDFs, CSVs, and custom data via Deep Research.

AI Tools & Applications Deep Research API : Phil Schmid @_philschmid shared that Gemini Interactions API (beta) now supports multimodal inputs like images, PDFs, CSVs, and custom data via Deep Research. v0 Prompt Directory : V0 @v0 highlighted a prompt directory by v0 Ambassador @rajoninternet as a quick start to ship AI apps. LlamaSheets : Llama Index @llama_index launched LlamaSheets to parse complex Excel files into AI-ready data while preserving semantic context and hierarchy.

Stay updated on Gemini Interactions API

Get curated AI PM insights delivered daily — covering this and 1,000+ other sources.

Subscribe Free