Evaluation record · claude-opus-4-6

Claude Opus 4.6

v20260205

Anthropic

Modelcodingreasoningenterprisehipaa-eligible

Exceptional

About This Model

Anthropic's frontier Opus released February 2026 with 80.8% SWE-bench Verified, breakthrough 68.8% ARC-AGI-2 abstract reasoning, adaptive thinking, and a 1M token context window. Now two generations behind Opus 4.8 but still served.

Last Evaluated: July 9, 2026

Official Website

Trust Vector Analysis

Dimension Breakdown

🚀Performance & Reliability

Generational leap in abstract reasoning (68.8% ARC-AGI-2, ~2x Opus 4.5). 80.8% SWE-bench with 1M context and 128K output. Introduced adaptive thinking and GA effort parameter including 'max'. Now two generations behind Opus 4.8 but still fully served.

task accuracy code

Industry-standard coding and agentic benchmarks measuring real-world software engineering and computer-use tasks

Evidence

SWE-bench Verified — 80.8% resolution rate (frontier-class software engineering)

Terminal-Bench 2.0 — 65.4% on command-line tasks (up from Opus 4.5's 59.3%)

OSWorld — 72.7% on computer-use tasks (up from Opus 4.5's 66.3%)

highVerified: 2026-07-09

task accuracy reasoning

Abstract reasoning and multi-step problem solving benchmarks

Evidence

ARC-AGI-2 — 68.8% (up from Opus 4.5's 37.6% — a generational leap in abstract reasoning)

Anthropic Announcement — Adaptive thinking dynamically allocates reasoning depth per request

highVerified: 2026-07-09

task accuracy general

Comprehensive knowledge and multimodal testing across published benchmarks

Evidence

Anthropic Models Documentation — Frontier-class general knowledge and multimodal understanding at launch

highVerified: 2026-07-09

output consistency

Internal testing of output stability across effort levels and adaptive thinking

Evidence

Anthropic Announcement — Effort parameter GA (low/medium/high/max) enables consistent quality control; adaptive thinking replaces manual budgets

highVerified: 2026-07-09

latency p50

Median latency for API requests with standard prompt sizes

Evidence

Community benchmarking — Typical response time ~2.5s for standard prompts at default effort

mediumVerified: 2026-07-09

latency p95

95th percentile response time across diverse workloads

Evidence

Community benchmarking — p95 latency ~5.5s; higher at max effort

mediumVerified: 2026-07-09

context window

Official specification from provider

Evidence

Anthropic Models Documentation — 1M token context window (beta at launch, since standard); 128K max output tokens

highVerified: 2026-07-09

uptime

Historical uptime data from official status page

Evidence

Anthropic Status Page — Claude API uptime 99.57% (last 90 days)

highVerified: 2026-07-09

🛡️Security

Strong safety posture. Removal of last-assistant-turn prefills (400 error) eliminates a common response-manipulation pattern; structured outputs replace it.

prompt injection resistance

Testing against OWASP LLM01 prompt injection attacks

Evidence

Anthropic Safety Research — Improved resistance to prompt injection in agentic and computer-use settings

highVerified: 2026-07-09

jailbreak resistance

Testing against adversarial prompt datasets

Evidence

Anthropic Constitutional AI — Constitutional AI alignment carried forward with enhanced refusal calibration

highVerified: 2026-07-09

data leakage prevention

Analysis of privacy policies and data handling practices

Evidence

Anthropic Privacy Statement — No training on user data without explicit consent

mediumVerified: 2026-07-09

output safety

Comprehensive safety testing across harmful content categories

Evidence

Anthropic Safety Evaluations — Released with comprehensive safety evaluations under the Responsible Scaling Policy

highVerified: 2026-07-09

api security

Review of API security features and best practices

Evidence

Anthropic API Documentation — API key authentication, HTTPS only, rate limiting; removal of last-assistant-turn prefills closes a response-steering vector

highVerified: 2026-07-09

🔒Privacy & Compliance

Exceptional privacy posture with ephemeral data handling and strong compliance certifications. HIPAA eligible for healthcare.

data residency

Review of enterprise documentation and privacy policies

Evidence

Anthropic Enterprise Documentation — Data residency options for US and EU customers

highVerified: 2026-07-09

training data optout

Analysis of privacy policy and data usage terms

Evidence

Anthropic Privacy Policy — Opt-out available, no training on API data by default

highVerified: 2026-07-09

data retention

Review of terms of service and data retention policies

Evidence

Anthropic Terms of Service — API prompts and outputs not retained (except for trust & safety)

highVerified: 2026-07-09

pii handling

Review of data protection capabilities and customer responsibilities

Evidence

Anthropic Privacy Documentation — Customer responsible for PII redaction

mediumVerified: 2026-07-09

compliance certifications

Verification of compliance certifications and audit reports

Evidence

Anthropic Trust Center — SOC 2 Type II, GDPR compliant, HIPAA eligible

highVerified: 2026-07-09

zero data retention

Review of data handling practices

Evidence

Anthropic API Documentation — Ephemeral data processing, no storage of prompts/outputs

highVerified: 2026-07-09

👁️Trust & Transparency

Adaptive thinking improves transparency by making reasoning depth model-driven and observable. Strong instruction following reduces need for aggressive prompt engineering.

explainability

Evaluation of reasoning transparency and explanation capabilities

Evidence

Adaptive Thinking Feature — Adaptive thinking surfaces reasoning depth decisions; effort parameter provides explicit compute transparency

highVerified: 2026-07-09

hallucination rate

Testing on factual QA datasets and real-world usage

Evidence

Anthropic Testing — Improved factual calibration over Opus 4.5, especially at high and max effort

mediumVerified: 2026-07-09

bias fairness

Evaluation on bias benchmarks and diverse demographic testing

Evidence

Anthropic Responsible Scaling Policy — Regular bias testing and mitigation

mediumVerified: 2026-07-09

uncertainty quantification

Qualitative assessment of confidence expression in outputs

Evidence

Model Behavior — Model expresses uncertainty appropriately; adaptive thinking scales effort with problem difficulty

mediumVerified: 2026-07-09

model card quality

Review of documentation completeness and clarity

Evidence

Anthropic Model Documentation — Comprehensive model cards with capabilities, limitations, benchmarks

highVerified: 2026-07-09

training data transparency

Review of public disclosures about training data

Evidence

Anthropic Public Statements — General description provided, detailed sources not disclosed

mediumVerified: 2026-07-09

guardrails

Analysis of built-in safety mechanisms

Evidence

Constitutional AI — Constitutional AI safety guardrails with improved refusal calibration

highVerified: 2026-07-09

⚙️Operational Excellence

Mature operational profile with multi-cloud availability. Migration to 4.6 required removing assistant-turn prefills and moving to adaptive thinking — well-documented breaking changes.

api design quality

Review of API design, consistency, and feature completeness

Evidence

Anthropic API Documentation — Adaptive thinking, GA effort parameter (incl. max), structured outputs; prefills removed in favor of output_config.format

highVerified: 2026-07-09

sdk quality

Review of SDK quality, documentation, and maintenance

Evidence

Anthropic SDKs — Official SDKs for Python, TypeScript, Java, Go, Ruby, C#, PHP — actively maintained

highVerified: 2026-07-09

versioning policy

Review of versioning policy and historical practices

Evidence

Anthropic API Versioning — Clear versioning with advance deprecation notice; Opus 4.6 remains served two generations behind Opus 4.8

Anthropic Model Deprecations — Active; tentative retirement not sooner than February 5, 2027

highVerified: 2026-07-09

monitoring observability

Review of available monitoring tools and metrics

Evidence

Anthropic Console — Usage dashboard with metrics

mediumVerified: 2026-07-09

support quality

Assessment of documentation, community, and support responsiveness

Evidence

Anthropic Support — Email support, Discord community, comprehensive docs and migration guides

highVerified: 2026-07-09

ecosystem maturity

Analysis of third-party integrations and tools

Evidence

Cloud Providers — Available on AWS Bedrock, Google Vertex AI, Azure Foundry

highVerified: 2026-07-09

license terms

Review of licensing terms and restrictions

Evidence

Anthropic Terms of Service — Standard commercial terms, enterprise agreements available

highVerified: 2026-07-09

Strengths

+Breakthrough abstract reasoning: 68.8% ARC-AGI-2 (up from Opus 4.5's 37.6%)
+Elite coding: 80.8% SWE-bench Verified, 65.4% Terminal-Bench 2.0
+Best-in-class computer use at launch: 72.7% OSWorld
+1M token context window (beta at launch) with 128K max output
+Adaptive thinking replaces manual thinking budgets — no tuning required
+Effort parameter GA including new 'max' level for compute control
+Same $5/$25 pricing as Opus 4.5 despite major capability gains

Limitations

!Two generations behind current Opus 4.8 (still served, but no longer frontier)
!Removed last-assistant-turn prefills — code relying on prefills returns 400
!Higher latency than Sonnet models (~2.5s p50)
!Premium pricing relative to Sonnet 4.6 ($5/$25 vs $3/$15)
!No native audio capabilities
!Training data transparency limited (industry standard)

Metadata

pricing

input: $5.00 per 1M tokens

output: $25.00 per 1M tokens

notes: Same pricing as Opus 4.5. Batch API 50% discount. Prompt caching up to 90% savings. Confirmed unchanged at $5/$25 as of 2026-07-09.

last verified: 2026-07-09

context window: 1000000

max output: 128000

languages

0: English

1: Spanish

2: French

3: German

4: Italian

5: Portuguese

6: Japanese

7: Korean

8: Chinese

9: Arabic

10: Hindi

modalities

0: text

1: image (input)

2: document

3: computer-use

api endpoint: https://api.anthropic.com/v1/messages

open source: false

architecture: Transformer-based with Constitutional AI alignment, adaptive thinking, and effort parameter

parameters: Not disclosed

knowledge cutoff: May 2025 (reliable); training data through August 2025

Use Case Ratings

code generation

80.8% SWE-bench Verified and 65.4% Terminal-Bench 2.0. Excellent for complex software engineering, though Opus 4.7/4.8 now lead the family.

customer support

Strong empathy and natural conversation. Higher latency and cost than Sonnet for routine support volume.

content creation

Excellent long-form, nuanced content. Adaptive thinking allocates more reasoning to complex pieces automatically.

data analysis

Strong analytical capabilities with 1M context for large datasets. Effort 'max' useful for complex interpretation.

research assistant

1M context and 68.8% ARC-AGI-2 abstract reasoning make it exceptional for deep research and synthesis.

legal compliance

Strong privacy posture, HIPAA eligible. 1M context handles entire contract repositories in a single request.

healthcare

HIPAA eligible with strong privacy controls. Good for clinical documentation requiring high accuracy.

financial analysis

Excellent quantitative reasoning. Adaptive thinking scales analysis depth with problem complexity.

education

Excellent tutoring with patient explanations. Effort parameter lets platforms balance quality against cost.

creative writing

Strong creative capabilities with nuanced character development and narrative flow.

Similar Models

Claude Opus 4.8

Anthropic

Claude Opus 4.7

Anthropic

Claude Opus 4.5

Anthropic

Claude Sonnet 4.6

Anthropic