Evaluation record · gpt-5-5

GPT-5.5

vgpt-5-5-2026-04-23

OpenAI

Modelreasoningflagshipmillion-token-contextagentic

Exceptional

About This Model

OpenAI's flagship from April 2026 (codename 'Spud'), first fully retrained base model since GPT-4.5. Succeeded at the top of the lineup by the GPT-5.6 family (Sol/Terra/Luna, publicly released 2026-07-09) but fully supported with no announced shutdown, and still the designated migration target for most of the GPT-5.x line. ~1.05M context, 85.0% ARC-AGI-2, 93.6% GPQA Diamond, 58.6% SWE-Bench Pro.

Last Evaluated: July 9, 2026

Official Website

Trust Vector Analysis

Dimension Breakdown

🚀Performance & Reliability

First fully retrained base since GPT-4.5. State-of-the-art across reasoning (85.0% ARC-AGI-2, 93.6% GPQA) and agentic coding (82.7% Terminal-Bench 2.0). ~40% more token-efficient than GPT-5.4.

task accuracy code

Industry-standard coding and terminal benchmarks measuring real-world software engineering tasks

Evidence

SWE-Bench Pro — 58.6% on SWE-Bench Pro (state-of-the-art at release)

Terminal-Bench 2.0 — 82.7% on command-line agentic tasks

highVerified: 2026-07-09

task accuracy reasoning

PhD-level science, frontier mathematics, and abstract reasoning benchmarks

Evidence

GPQA Diamond — 93.6% (PhD-level science questions)

ARC-AGI-2 — 85.0% (large jump over GPT-5.2's 52.9%, industry-leading abstract reasoning)

FrontierMath Tier 1-3 — 51.7% (up from GPT-5.2's 40.3%)

highVerified: 2026-07-09

task accuracy general

Expert-comparison knowledge work and computer-use benchmarks

Evidence

GDPval — 84.9% on economically valuable knowledge-work tasks

OSWorld-Verified — 78.7% on computer-use tasks (improving on GPT-5.4's 75%)

highVerified: 2026-07-09

output consistency

Internal consistency testing reported by provider across reasoning effort levels

Evidence

OpenAI Announcement — First fully retrained base since GPT-4.5; ~40% fewer output tokens than GPT-5.4 for equivalent quality

highVerified: 2026-07-09

latency p50

Median latency for API requests with standard prompt sizes

Evidence

Community benchmarking — Token efficiency gains (~40% fewer output tokens) reduce end-to-end response times vs GPT-5.4

mediumVerified: 2026-07-09

latency p95

95th percentile response time across diverse workloads

Evidence

Community benchmarking — p95 latency varies significantly with reasoning effort setting

mediumVerified: 2026-07-09

context window

Official specification from provider

Evidence

OpenAI Documentation — ~1.05M token input context, 128K max output tokens

highVerified: 2026-07-09

uptime

Historical uptime data from official status page

Evidence

OpenAI Status — 99.9% uptime (last 90 days)

highVerified: 2026-07-09

🛡️Security

Mature multi-layer safety stack. Retrained base required full safety recalibration, which OpenAI reports as complete; long-tail agentic behaviors still being characterized by third parties.

prompt injection resistance

Testing against OWASP LLM01 prompt injection attacks

Evidence

OpenAI Safety Research — Strengthened injection defenses for agentic and computer-use workflows

mediumVerified: 2026-07-09

jailbreak resistance

Adversarial prompt testing against jailbreak datasets

Evidence

GPT-5.5 Announcement — Retrained base with updated safety training; improved refusal robustness over GPT-5.4

mediumVerified: 2026-07-09

data leakage prevention

Analysis of privacy policies and data handling practices

Evidence

OpenAI Privacy Policy — No training on API data by default

mediumVerified: 2026-07-09

output safety

Safety testing across harmful content categories

Evidence

OpenAI Safety — Multi-layer safety stack carried forward and recalibrated for the retrained base

highVerified: 2026-07-09

api security

Review of API security features and best practices

Evidence

OpenAI Platform Docs — API key + OAuth2 authentication, HTTPS only, rate limiting

highVerified: 2026-07-09

🔒Privacy & Compliance

Standard OpenAI enterprise posture: SOC 2, no API-data training by default, 30-day default retention with zero-data-retention options.

data residency

Review of enterprise documentation

Evidence

OpenAI Enterprise — Data residency options for enterprise customers

highVerified: 2026-07-09

training data optout

Policy review of data usage terms

Evidence

OpenAI Data Controls — API data not used for training by default

highVerified: 2026-07-09

data retention

Evidence

OpenAI Terms — 30-day default API log retention; zero-data-retention options for qualifying customers

highVerified: 2026-07-09

pii handling

Review of data protection capabilities

Evidence

OpenAI Safety Tools — Customer responsible for PII redaction; moderation API available

mediumVerified: 2026-07-09

compliance certifications

Verification of compliance certifications

Evidence

OpenAI Trust Center — SOC 2 Type II, ISO 27001, GDPR compliant

highVerified: 2026-07-09

zero data retention

Enterprise feature review

Evidence

OpenAI Enterprise — Zero-data-retention options available for enterprise and qualifying API customers

highVerified: 2026-07-09

👁️Trust & Transparency

Strong transparency with reasoning summaries and detailed release documentation. Training data disclosure remains at industry-standard (limited) level.

explainability

Evaluation of reasoning transparency and explanation capabilities

Evidence

GPT-5.5 Announcement — Adjustable reasoning effort with visible reasoning summaries

highVerified: 2026-07-09

hallucination rate

Factual accuracy testing on QA datasets

Evidence

OpenAI Announcement — Retrained base continues factuality gains; builds on GPT-5.4's ~33% factual-error reduction

mediumVerified: 2026-07-09

bias fairness

Bias benchmarks and demographic testing

Evidence

OpenAI Safety — Regular bias testing and red-teaming program

mediumVerified: 2026-07-09

uncertainty quantification

Qualitative assessment of confidence expression in outputs

Evidence

OpenAI Documentation — Improved calibrated uncertainty expression with reduced confident errors

mediumVerified: 2026-07-09

model card quality

Documentation completeness and clarity review

Evidence

GPT-5.5 Announcement and System Card — Detailed release documentation covering capabilities, benchmarks, and migration guidance

highVerified: 2026-07-09

training data transparency

Review of public disclosures about training data

Evidence

OpenAI Blog — General description of training approach; specific sources not disclosed

mediumVerified: 2026-07-09

guardrails

Analysis of built-in safety mechanisms

Evidence

OpenAI Safety Systems — Multi-layer safety guardrails with agentic-workflow protections

highVerified: 2026-07-09

⚙️Operational Excellence

Industry-leading operational maturity. As the designated GPT-5.x migration target, GPT-5.5 offers a long expected support horizon even after the GPT-5.6 family (Sol/Terra/Luna) launched 2026-07-09.

api design quality

Review of API design, consistency, and feature completeness

Evidence

OpenAI API Reference — Responses API with streaming, function calling, vision, computer use, reasoning effort control

highVerified: 2026-07-09

sdk quality

SDK quality, documentation, and maintenance review

Evidence

OpenAI SDKs — Official SDKs for Python, Node.js, Go, .NET, actively maintained

highVerified: 2026-07-09

versioning policy

Review of versioning policy and historical deprecation practices

Evidence

OpenAI Deprecations — Published deprecation schedule; GPT-5.5 is the designated migration target for most of the GPT-5.x line

OpenAI Deprecations — Re-verified: GPT-5.5 remains the named replacement for gpt-5, gpt-5.1, gpt-5.2-chat-latest, gpt-5.2-codex, and gpt-5.3-chat-latest deprecations; no GPT-5.5 deprecation announced

OpenAI: Previewing GPT-5.6 Sol — GPT-5.6 family (Sol/Terra/Luna) previewed 2026-06-26 under U.S. government-requested partner-only restrictions and publicly released 2026-07-09; Sol priced $5/$30, Terra $2.50/$15, Luna $1/$6 per 1M tokens

highVerified: 2026-07-09

monitoring observability

Review of available monitoring tools and metrics

Evidence

OpenAI Dashboard — Detailed usage dashboard with costs, tokens, rate limits

highVerified: 2026-07-09

support quality

Support and documentation assessment

Evidence

OpenAI Support — 24/7 support, comprehensive docs, active developer community

highVerified: 2026-07-09

ecosystem maturity

Ecosystem breadth and depth analysis

Evidence

OpenAI Platform — Largest AI ecosystem; available in ChatGPT (2026-04-23) and API (2026-04-24) with Batch and Flex tiers

highVerified: 2026-07-09

license terms

Review of licensing terms and restrictions

Evidence

OpenAI Terms — Standard commercial terms with usage policies

highVerified: 2026-07-09

Strengths

+Industry-leading abstract reasoning: 85.0% ARC-AGI-2, 93.6% GPQA Diamond
+State-of-the-art agentic coding: 58.6% SWE-Bench Pro, 82.7% Terminal-Bench 2.0
+~1.05M token input context with 128K output
+First fully retrained base since GPT-4.5 with ~40% fewer output tokens than GPT-5.4
+84.9% GDPval on economically valuable knowledge work
+Designated long-term migration target for the GPT-5.x line
+Batch/Flex tiers at 50% discount

Limitations

!Premium pricing: $5/$30 per 1M tokens (2x GPT-5.4's base rate)
!No longer the newest model: GPT-5.6 family released 2026-07-09; GPT-5.6 Terra is reported competitive with GPT-5.5 at half the price ($2.50/$15)
!Not HIPAA eligible
!30-day default API data retention (zero retention requires enterprise arrangement)
!GPT-5.5 Pro is very expensive ($30/$180 per 1M)
!Recently retrained base — long-tail behaviors less battle-tested than GPT-5.x predecessors
!Training data transparency limited (industry standard)

Metadata

pricing

input: $5.00 per 1M tokens

output: $30.00 per 1M tokens

notes: Confirmed on official pricing page 2026-07-09. Cached input $0.50 per 1M; Batch at 50% discount ($2.50/$15). GPT-5.5 Pro priced at $30/$180 per 1M tokens.

last verified: 2026-07-09

context window: 1050000

max output: 128000

languages

0: English

1: Spanish

2: French

3: German

4: Italian

5: Portuguese

6: Japanese

7: Korean

8: Chinese

9: Russian

10: Arabic

11: Hindi

12: 50+ languages

modalities

0: text

1: vision

2: computer-use

api endpoint: https://api.openai.com/v1/responses

open source: false

architecture: Transformer-based; first fully retrained base since GPT-4.5 ('Spud')

parameters: Not disclosed

knowledge cutoff: December 2025

Use Case Ratings

code generation

58.6% SWE-Bench Pro and 82.7% Terminal-Bench 2.0. ~1.05M context fits very large codebases. Codex variants remain preferable for dedicated agentic coding pipelines.

customer support

Token efficiency (~40% fewer output tokens) lowers cost per conversation. $5/$30 pricing is premium for high-volume support.

content creation

Retrained base produces concise, higher-quality drafts. Strong long-form coherence over very long contexts.

data analysis

93.6% GPQA and 51.7% FrontierMath T1-3 support rigorous quantitative work. ~1.05M context enables whole-dataset reasoning.

research assistant

84.9% GDPval on expert knowledge work with ~1.05M context for literature-scale inputs.

legal compliance

Strong document reasoning; SOC 2 and zero-data-retention options available, but not HIPAA eligible and 30-day default retention.

healthcare

Excellent clinical reasoning (93.6% GPQA) but not HIPAA eligible; privacy controls less strict than Anthropic's.

financial analysis

Frontier math performance (51.7% FrontierMath T1-3) and GDPval results support complex financial modeling.

education

Top-tier STEM reasoning with adjustable effort for tutoring at different depths. Reduced hallucinations vs prior generations.

creative writing

Strong narrative quality; conciseness bias from token-efficiency training can need prompting for expansive prose.

Similar Models

GPT-5.4

OpenAI

GPT-5.3-Codex

OpenAI

GPT-5.2

OpenAI

Claude Opus 4.8

Anthropic

Gemini 3.1 Pro

Google