Claude Opus 4.6
A Claude model version referenced as part of a prompt-comparison analysis. It serves as one endpoint for examining changes in Anthropic’s system prompt evolution.
Key Highlights
- Claude Opus 4.6 was used as both a production model and a baseline for comparing later Anthropic releases such as Opus 4.7.
- Coverage linked the model to coding, long-document analysis, browser-security testing, and multi-agent design workflows.
- Discussions around BrowseComp and other evals made Opus 4.6 a useful example of how benchmark setup can materially affect perceived model quality.
- Comparisons with Opus 4.7 showed that newer models can improve on standard benchmarks while still regressing on specific tasks.
- Anthropic's published system prompts made Opus 4.6 especially valuable for studying system prompt evolution over time.
Claude Opus 4.6
Overview
Claude Opus 4.6 is an Anthropic model version that appeared repeatedly across early 2026 product, engineering, and evaluation discussions. In the source material, it shows up both as a production-grade model used in coding, agentic workflows, security testing, and multi-agent design systems, and as a reference point for comparing later Anthropic releases such as Claude Opus 4.7. It is also cited in analyses of Anthropic's published system prompts, making it useful not just as a model endpoint but as a snapshot in the evolution of model behavior and instruction design.For AI Product Managers, Claude Opus 4.6 matters because it sits at the intersection of model capability, benchmark interpretation, and product tradeoffs. Coverage around the model highlights practical deployment themes: stronger coding and long-context workflows, higher token usage and controllable effort settings, sensitivity to eval design, and meaningful differences between adjacent model versions. That makes it a useful case study for PMs evaluating model upgrades, benchmark claims, agent reliability, and vendor transparency.
Key Developments
- 2026-02-12: Head-to-head testing compared GPT-5.3 Codex in Codeex with Anthropic Opus 4.6 and Opus 4.6 Fast in Cursor for a large redesign and refactor effort, with 93,000 lines of code reportedly shipped in five days.
- 2026-02-13: PromptLayer published a team review of Claude Opus 4.6 based on testing across coding workflows, long-document analysis, and agentic pipelines, emphasizing real-world engineering performance.
- 2026-02-18: claire vo compared GPT-5 3 Codex and Claude Opus 4.6 across code-generation benchmarks, product features, and practical API use cases.
- 2026-02-22: PromptLayer's review was highlighted again, noting that Opus 4.6 landed in February 2026 and was evaluated for coding, long-document analysis, and agentic use cases. Commentary also noted that Opus 4.6 and Sonnet 4.6 could deliver more intelligent outputs at the cost of higher token usage, with lower-effort settings available for cheaper runs.
- 2026-03-07: Anthropic and Mozilla reportedly tested Claude's Opus 4.6 agent on Firefox, uncovering 22 vulnerabilities in two weeks, including 14 high-severity issues.
- 2026-03-08: Pencil's swarm mode used six agents powered by Claude Opus 4.6 to collaboratively design a mobile travel log app, exporting a JSON-based design artifact that could be converted into a React + Tailwind + Next.js site.
- 2026-03-14: An article examined how eval-awareness affected Claude Opus 4.6's performance on the BrowseComp benchmark, underscoring how benchmark setup can influence observed results.
- 2026-04-07: Cursor's Composer 2 was reported to outperform Claude Opus 4.6 on "Trust Me Bro" benchmarks for intelligence, speed, and cost, while discussion also focused on the underlying model identity behind Composer 2.
- 2026-04-18: Claude Opus 4.7 was described as outperforming Opus 4.6 on many standard benchmarks through adaptive thinking, but regressing on trick questions, web browsing via browse_comp, and OCR relative to some comparisons.
- 2026-04-19: Simon Willison analyzed changes between the published system prompts for Claude Opus 4.6 and 4.7, using Opus 4.6 as the baseline for understanding Anthropic's system prompt evolution.
Relevance to AI PMs
1. Model upgrade decisions: Claude Opus 4.6 is a strong example of why PMs should compare adjacent model versions beyond aggregate benchmark scores. The reporting around Opus 4.7 shows that improvements on standard evals can still come with regressions on browsing, OCR, or trick-question robustness.2. Benchmark and eval design: Mentions tied to BrowseComp and other evaluations show that observed performance can depend heavily on setup, task framing, and model awareness of the eval. PMs should validate vendor claims with task-specific evals that resemble their own production workflows.
3. Agentic product planning: Opus 4.6 appears in coding copilots, browser-security testing, and multi-agent design tools, making it relevant for PMs building agentic experiences. The key lesson is to assess not only raw intelligence, but also token economics, controllable effort, reliability under orchestration, and fit for domain-specific tasks.
Related
- Anthropic: The company behind Claude Opus 4.6 and the broader Claude model family.
- Claude / claude-code: Related product surfaces and developer workflows where Opus-class models may be used.
- Claude Opus 4.7 / sonnet-46: Nearby Anthropic model versions used for comparison on capability, prompting, and cost-performance tradeoffs.
- claude-system-prompts: Directly relevant because Opus 4.6 was analyzed as part of system prompt change tracking.
- Cursor / cursor-30 / composer-2: Coding and agent interfaces where Opus 4.6 appeared in competitive and workflow comparisons.
- PromptLayer: Published a review of Opus 4.6 based on practical engineering usage.
- Mozilla / Firefox: Connected through security-agent testing that reportedly surfaced major vulnerabilities.
- browsecomp / pencil: Examples of benchmark and product contexts in which Opus 4.6 was evaluated or deployed.
- gpt-5-3-codex / gpt-53-codex / gemini-3-flash: Competing models referenced in comparative analysis.
- Simon Willison / claire-vo: Commentators and analysts who helped contextualize Opus 4.6 through prompt analysis and model comparisons.
Newsletter Mentions (11)
“A detailed look at how Anthropic's Claude system prompt changed between Opus 4.6 and 4.7, using their published system prompts as the basis for analysis.”
#2 📝 Simon Willison Changes in the system prompt between Claude Opus 4.6 and 4.7 - A detailed look at how Anthropic's Claude system prompt changed between Opus 4.6 and 4.7, using their published system prompts as the basis for analysis. The post highlights the value of Anthropic publishing system prompts and links to deeper notes and artifacts used in the research.
“Claude Opus 4.7 uses adaptive thinking to allocate less inference time on perceived-easy tasks, which improves its performance over Opus 4.6 on most standard benchmarks but leads to regressions on trick questions (Simple Bench), web browsing (browse_comp), and OCR tests (vs. Gemini 3 Flash).”
#18 ▶️ Claude Opus 4.7 - A New Frontier, in Performance … and Drama AI Explained Claude Opus 4.7 uses adaptive thinking to allocate less inference time on perceived-easy tasks, which improves its performance over Opus 4.6 on most standard benchmarks but leads to regressions on trick questions (Simple Bench), web browsing (browse_comp), and OCR tests (vs. Gemini 3 Flash). On the Simple Bench trick-question benchmark, Claude Opus 4.7 scored lower than Opus 4.6 because it underestimates task difficulty and reduces inference compute.
“Composer 2 outscored Claude Opus 4.6 on “Trust Me Bro” benchmarks for intelligence, speed, and cost, but its metadata model ID revealed it is Moonshot’s Kimmy K2 retrained with reinforcement learning.”
#14 ▶️ Cursor ditches VS Code, but not everyone is happy... Fireship Cursor 3.0, rewritten in Rust and TypeScript and powered by its in-house Composer 2 model (based on Moonshot’s Kimmy K2), replaces the VS Code fork with an AI-agent orchestration interface across local repos, remote SSH sessions, and the cloud. Composer 2 outscored Claude Opus 4.6 on “Trust Me Bro” benchmarks for intelligence, speed, and cost, but its metadata model ID revealed it is Moonshot’s Kimmy K2 retrained with reinforcement learning.
“This article discusses how eval-awareness affects Claude Opus 4.6’s performance on the BrowseComp benchmark, examining interactions between model behavior and evaluation setup.”
This article discusses how eval-awareness affects Claude Opus 4.6’s performance on the BrowseComp benchmark, examining interactions between model behavior and evaluation setup. It emphasizes the role of evaluation design in producing reliable performance measurements.
“Six AI agents powered by Cloud Opus 4.6 in Pencil’s new swarm mode collaboratively design three screens of a mobile travel log app with Oceanania imagery and export the result as a JSON “pen file” that is then converted into a React + Tailwind + Next.js website running on port 8080.”
Six AI agents powered by Cloud Opus 4.6 in Pencil’s new swarm mode collaboratively design three screens of a mobile travel log app with Oceanania imagery and export the result as a JSON “pen file” that is then converted into a React + Tailwind + Next.js website running on port 8080. Pencil’s swarm mode (released Tuesday) assigns six subagents to design three app screens in parallel, each subagent indicated by its own cursor on the canvas. The design is stored in a JSON-based “pen file” format that can be converted to Swift iOS, Kotlin or React Native and has community plugins to export to Figma and Lovable.
“#10 𝕏 Anthropic partnered with Mozilla to test Claude’s Opus 4.6 agent on Firefox, uncovering 22 vulnerabilities in two weeks.”
GenAI PM Daily March 07, 2026 GenAI PM Daily 🎧 Listen to this brief 3 min listen Today's top 25 insights for PM Builders, ranked by relevance from LinkedIn, YouTube, X, and Blogs. #7 𝕏 Claude launched the Claude Marketplace in limited preview, offering enterprises a centralized platform to streamline and simplify procurement of AI tools. #10 𝕏 Anthropic partnered with Mozilla to test Claude’s Opus 4.6 agent on Firefox, uncovering 22 vulnerabilities in two weeks. Fourteen were high-severity, representing 20% of Mozilla’s 2025 critical fixes.
“#4 📝 PromptLayer Blog Opus 4.6 — PromptLayer Team Review - A team review of Claude Opus 4.6 which landed in February 2026, evaluating its performance across coding workflows, long-document analysis, and agentic pipelines.”
#4 📝 PromptLayer Blog Opus 4.6 — PromptLayer Team Review - A team review of Claude Opus 4.6 which landed in February 2026, evaluating its performance across coding workflows, long-document analysis, and agentic pipelines. #9 𝕏 Boris Cherny says Opus 4.6 and Sonnet 4.6 deliver more intelligent outputs at the cost of higher token usage, and you can use `/model` to set effort to low or medium for lighter, more economical runs.
“claire vo 🖤 breaks down GPT-5 3 Codex vs Claude Opus 4.6 in her latest video and blog post, comparing their code-generation benchmarks, feature sets, and real-world API use cases.”
GenAI PM Daily February 18, 2026 GenAI PM Daily Today's top 25 insights for PM Builders, ranked by relevance from X, Blogs, YouTube, and LinkedIn. Anthropic Launches Claude Sonnet 4.6 #19 𝕏 claire vo 🖤 breaks down GPT-5 3 Codex vs Claude Opus 4.6 in her latest video and blog post, comparing their code-generation benchmarks, feature sets, and real-world API use cases. #21 𝕏 DeepLearning.AI Andrew Ng urges Hollywood and AI developers to collaborate on shared guardrails around generative AI, based on conversations at Sundance. The Batch also highlights SpaceX’s acquisition of xAI for orbital AI data centers, Claude Opus 4.
“PromptLayer Blog Opus 4.6 — PromptLayer Team Review - PromptLayer's team reviewed Claude Opus 4.6 after extensive testing across coding workflows, long-document analysis, and agentic pipelines. The article shares the team's verdict and insights about how the release performs in real-world engineering scenarios.”
GenAI PM Daily February 13, 2026 GenAI PM Daily 🎧 Listen to this brief 3 min listen Today's top 25 insights for PM Builders, ranked by relevance from Blogs, X, YouTube, and LinkedIn. OpenAI Introduces GPT-5.3-Codex-Spark Model #1 📝 OpenAI News Introducing GPT-5.3-Codex-Spark - Announces the GPT-5.3-Codex-Spark product release, highlighting new Codex-powered capabilities for developers and product teams. The post introduces the model and its intended use cases and availability. Also covered by: @Simon Willison #2 𝕏 Demis Hassabis rolled out Gemini 3’s new “Deep Think” mode for Google AI Ultra subscribers in the Gemini App, enabling more advanced reasoning and complex problem-solving capabilities. Also covered by: @Josh Woodward , @Demis Hassabis , @Google AI, @Sundar Pichai , @Sundar Pichai #3 𝕏 Sam Altman launched GPT-5.3-Codex-Spark as a research preview for Pro today, delivering over 1,000 tokens per second with initial limitations that will be rapidly improved.
“Head-to-head testing of OpenAI GPT-5.3 Codex in Codeex and Anthropic Opus 4.6 (plus Opus 4.6 Fast) in Cursor to redesign a PLG+enterprise marketing site and refactor core application components, resulting in 93,000 lines of code shipped in five days.”
#5 ▶️ Claude Opus 4.6 vs GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days How I AI Podcast Head-to-head testing of OpenAI GPT-5.3 Codex in Codeex and Anthropic Opus 4.6 (plus Opus 4.6 Fast) in Cursor to redesign a PLG+enterprise marketing site and refactor core application components, resulting in 93,000 lines of code shipped in five days.
Related
A coding environment for Claude mentioned for its keyboard shortcut that opens a full-featured editor for prompt writing. It is highlighted as making long prompts far easier to manage.
The company behind Claude, mentioned as working with Peter Yang and Alex Albert on Claude's next iteration. It is referenced in the context of model design, harness design, and feedback evaluation.
Anthropic's AI assistant/model used here in multiple contexts: as the product being built next, as a system used to cluster feedback into synthetic evals, and as a tool that non-technical staff use.
An AI coding tool mentioned as part of the hidden setup tax for non-technical staff without proper enterprise scaffolding. It is referenced alongside Claude and ChatGPT in the context of adoption friction.
Developer and writer known for his AI tooling commentary and the `llm` project. He is credited here with the 0.32a2 release note.
A practitioner who used Claude and Cursor to generate a design system from GitHub repos. Relevant to PMs for rapid product and design-system iteration.
A platform and blog focused on LLM infrastructure and observability. It is relevant to PMs building AI features that need tracing, evaluation, and operational debugging.
A Claude model variant referenced as the basis for Cursor’s Fast mode. It is presented as a higher-cost, faster option for coding tasks.
Anthropic’s engineering group, credited here with a write-up on scaling managed agents. Useful as a source of architecture and design guidance for agent systems.
OpenAI’s coding-focused model/release highlighted for benchmark performance, steerability, and speed improvements. The newsletter frames it as a strong coding agent option with multiple benchmark scores.
An AI design/build tool that uses six agents to craft apps in real time. It is presented as part of the emerging agentic design workflow.
A Claude model version referenced for more intelligent outputs with higher token usage. It is discussed alongside Opus 4.6 and effort settings for economical runs.
A Gemini model used as a cheaper comparison point in benchmark and OCR evaluations. It is cited as outperforming Claude Opus 4.7 on OCR while costing far less per request.
A frontier model in Cursor with high usage limits, positioned for autonomous agent workflows.
Stay updated on Claude Opus 4.6
Get curated AI PM insights delivered daily — covering this and 1,000+ other sources.
Subscribe Free