reinforcement learning
A training approach used here to teach Composer to self-summarize, reducing reliance on handcrafted prompts.
Key Highlights
- Reinforcement learning helps AI systems improve behavior based on outcome feedback rather than prompts alone.
- Cursor used RL to train Composer to self-summarize, reportedly cutting compaction errors by 50%.
- Kimi Moonshot, Cursor, and Chroma were highlighted as training vertical agentic models with RL.
- For AI PMs, RL is most valuable when tied to measurable product outcomes like task success and error reduction.
Reinforcement learning
Overview
Reinforcement learning (RL) is a training approach where a model learns to make better decisions by receiving feedback on the outcomes of its actions, rather than only imitating examples or following fixed prompts. In AI products, RL is especially useful when success depends on multi-step behavior, adaptation, and optimizing for real-world results such as task completion, correctness, or reduced error rates.For AI Product Managers, RL matters because it offers a path to improve agent reliability and product performance beyond prompt engineering alone. The newsletter mentions show RL being used in two important ways: first, to train Composer to self-summarize instead of relying on handcrafted prompts, reducing compaction errors by 50%; and second, as a method used by companies like Kimi Moonshot, Cursor, and Chroma to train vertical agentic models on top of strong base models with production harnesses and outcome-based rewards. This makes RL highly relevant for products that need durable, measurable improvements in complex workflows.
Key Developments
- 2026-03-18: Cursor trained Composer to self-summarize via reinforcement learning instead of relying on a prompt. According to the newsletter mention, this cut compaction errors by 50% and helped Composer handle coding tasks requiring hundreds of actions.
- 2026-03-29: Philipp Schmid highlighted that Kimi Moonshot, Cursor, and Chroma train vertical agentic models via RL by combining a strong base model, a production harness, and outcome-based rewards. This framed RL as a practical recipe for improving domain-specific AI agents.
Relevance to AI PMs
- Improve product behavior beyond prompt tuning: RL can replace brittle prompt-based control with learned behaviors, which may lead to more consistent performance in long-running or multi-step tasks.
- Tie model improvement to product metrics: Outcome-based rewards let PMs align training with business-relevant goals such as task success, reduced error rates, higher completion rates, or better user satisfaction.
- Build defensible vertical agents: The combination of a capable base model, production environment data, and RL can help teams create specialized agents that perform better in real workflows than generic models alone.
Related
- Kimi Moonshot: Mentioned as one of the companies training vertical agentic models with RL, suggesting RL is part of its applied model improvement strategy.
- Cursor: Featured prominently in both mentions: as a company using RL in vertical agents and as the team that trained Composer to self-summarize through RL.
- Chroma: Included alongside Kimi Moonshot and Cursor as an example of a company applying RL to vertical agentic systems.
- Composer: A concrete product example where RL was used to train self-summarization behavior, reducing errors and enabling longer coding workflows.
Newsletter Mentions (2)
“#2 𝕏 Philipp Schmid shows how @Kimi_Moonshot, @cursor_ai, and @trychroma all train vertical agentic models via RL using a strong base model, production harness, and outcome-based rewards.”
Today's top 10 insights for PM Builders from X and Blogs. #2 𝕏 Philipp Schmid shows how @Kimi_Moonshot, @cursor_ai, and @trychroma all train vertical agentic models via RL using a strong base model, production harness, and outcome-based rewards. K2.
“Cursor trained Composer to self-summarize via reinforcement learning instead of relying on a prompt, cutting compaction errors by 50% and enabling it to tackle coding tasks requiring hundreds of actions.”
#4 𝕏 Cursor trained Composer to self-summarize via reinforcement learning instead of relying on a prompt, cutting compaction errors by 50% and enabling it to tackle coding tasks requiring hundreds of actions.
Stay updated on reinforcement learning
Get curated AI PM insights delivered daily — covering this and 1,000+ other sources.
Subscribe Free