Evaluation record · gpt-5-4

GPT-5.4

vgpt-5-4-2026-03-05

OpenAI

Modelcomputer-useprevious-flagshipmillion-token-contextfactuality

Exceptional

About This Model

OpenAI's previous-generation flagship (superseded by GPT-5.5 in April 2026 and the GPT-5.6 family in July 2026). Fully supported with no announced shutdown; gpt-5.4-mini and gpt-5.4-nano are OpenAI's designated replacements for retiring GPT-5 mini/nano and gpt-4.1-nano. Headline native computer use with 75% OSWorld-Verified, ~33% fewer factual errors than GPT-5.2, ~1.05M context. Thinking, Pro, mini, and nano variants.

Last Evaluated: July 9, 2026

Official Website

Trust Vector Analysis

Dimension Breakdown

🚀Performance & Reliability

Strong generation defined by native computer use (75% OSWorld-Verified, +27.7 points over GPT-5.2) and a ~33% factual-error reduction. Superseded by GPT-5.5 six weeks after release.

task accuracy code

Industry-standard coding benchmarks and provider-reported comparisons

Evidence

OpenAI Announcement — Incremental coding gains over GPT-5.2; release emphasis on computer use rather than coding (GPT-5.3-Codex remained the coding lead at launch)

mediumVerified: 2026-07-09

task accuracy reasoning

PhD-level and Olympiad-level reasoning benchmarks

Evidence

OpenAI Announcement — Reasoning improvements over GPT-5.2 across science and math suites; Thinking and Pro variants for deeper reasoning

mediumVerified: 2026-07-09

task accuracy general

Computer-use benchmarks and factual accuracy testing

Evidence

OSWorld-Verified — 75% on OSWorld-Verified with native computer use (vs GPT-5.2's 47.3%)

OpenAI Announcement — ~33% fewer factual errors than GPT-5.2

highVerified: 2026-07-09

output consistency

Internal testing across model variants

Evidence

OpenAI Announcement — ~33% reduction in factual errors vs GPT-5.2 improves run-to-run reliability

highVerified: 2026-07-09

latency p50

Median latency for API requests with standard prompt sizes

Evidence

Community benchmarking — mini and nano variants optimized for low latency

mediumVerified: 2026-07-09

latency p95

95th percentile response time across diverse workloads

Evidence

Community benchmarking — p95 latency varies by variant and reasoning depth

mediumVerified: 2026-07-09

context window

Official specification from provider

Evidence

OpenAI Documentation — ~1.05M token input context, 128K max output tokens

highVerified: 2026-07-09

uptime

Historical uptime data from official status page

Evidence

OpenAI Status — 99.9% uptime (last 90 days)

highVerified: 2026-07-09

🛡️Security

Solid security posture with computer-use-specific guardrails. Native computer use expands the attack surface (UI-based injection) relative to text-only models.

prompt injection resistance

Testing against OWASP LLM01 prompt injection attacks

Evidence

OpenAI Safety Research — Injection defenses extended to native computer-use surfaces (screenshots, UI text)

mediumVerified: 2026-07-09

jailbreak resistance

Adversarial prompt testing against jailbreak datasets

Evidence

OpenAI Announcement — Improved refusal robustness over GPT-5.2 per release documentation

mediumVerified: 2026-07-09

data leakage prevention

Analysis of privacy policies and data handling practices

Evidence

OpenAI Privacy Policy — No training on API data by default

mediumVerified: 2026-07-09

output safety

Safety testing across harmful content categories

Evidence

OpenAI Safety — Multi-layer safety with additional guardrails for autonomous computer-use actions

highVerified: 2026-07-09

api security

Review of API security features and best practices

Evidence

OpenAI Platform Docs — API key + OAuth2 authentication, HTTPS only, rate limiting

highVerified: 2026-07-09

🔒Privacy & Compliance

Standard OpenAI enterprise posture: SOC 2, no API-data training by default, zero-data-retention options. Not HIPAA eligible.

data residency

Review of enterprise documentation

Evidence

OpenAI Enterprise — Data residency options for enterprise customers

highVerified: 2026-07-09

training data optout

Policy review of data usage terms

Evidence

OpenAI Data Controls — API data not used for training by default

highVerified: 2026-07-09

data retention

Evidence

OpenAI Terms — 30-day default API log retention; zero-data-retention options for qualifying customers

highVerified: 2026-07-09

pii handling

Review of data protection capabilities

Evidence

OpenAI Safety Tools — Customer responsible for PII redaction; moderation API available

mediumVerified: 2026-07-09

compliance certifications

Verification of compliance certifications

Evidence

OpenAI Trust Center — SOC 2 Type II, ISO 27001, GDPR compliant

highVerified: 2026-07-09

zero data retention

Enterprise feature review

Evidence

OpenAI Enterprise — Zero-data-retention options available for enterprise and qualifying API customers

highVerified: 2026-07-09

👁️Trust & Transparency

Marked transparency improvement via the ~33% factual-error reduction. Computer-use action logging aids auditability of agentic runs.

explainability

Evaluation of reasoning transparency and explanation capabilities

Evidence

GPT-5.4 Thinking Variant — Thinking variant exposes reasoning summaries; computer-use actions are step-logged

highVerified: 2026-07-09

hallucination rate

Factual accuracy testing on QA datasets

Evidence

OpenAI Announcement — ~33% fewer factual errors than GPT-5.2

highVerified: 2026-07-09

bias fairness

Bias benchmarks and demographic testing

Evidence

OpenAI Safety — Regular bias testing and red-teaming program

mediumVerified: 2026-07-09

uncertainty quantification

Qualitative assessment of confidence expression in outputs

Evidence

OpenAI Documentation — Better uncertainty expression accompanying the factual-error reduction

mediumVerified: 2026-07-09

model card quality

Documentation completeness and clarity review

Evidence

GPT-5.4 Announcement and System Card — Detailed release documentation covering variants, computer use, and safety evaluations

highVerified: 2026-07-09

training data transparency

Review of public disclosures about training data

Evidence

OpenAI Blog — General description of training approach; specific sources not disclosed

mediumVerified: 2026-07-09

guardrails

Analysis of built-in safety mechanisms

Evidence

OpenAI Safety Systems — Multi-layer safety guardrails including computer-use action confirmation

highVerified: 2026-07-09

⚙️Operational Excellence

Excellent operational maturity with the full variant family. Procurement caveat: superseded by GPT-5.5 only six weeks after release; plan migrations accordingly.

api design quality

Review of API design, consistency, and feature completeness

Evidence

OpenAI API Reference — Responses API with streaming, function calling, vision, and native computer use

highVerified: 2026-07-09

sdk quality

SDK quality, documentation, and maintenance review

Evidence

OpenAI SDKs — Official SDKs for Python, Node.js, Go, .NET, actively maintained

highVerified: 2026-07-09

versioning policy

Review of versioning policy and historical deprecation practices

Evidence

OpenAI Deprecations — Clear deprecation schedule; GPT-5.4 remains supported but OpenAI designates GPT-5.5 as the migration target for the GPT-5.x line

OpenAI Deprecations — Re-verified: GPT-5.4 not on the deprecations list; gpt-5.4-mini and gpt-5.4-nano are the named replacements for gpt-5-mini/gpt-5-nano (shutdown 2026-12-11) and gpt-4.1-nano (shutdown 2026-10-23)

highVerified: 2026-07-09

monitoring observability

Review of available monitoring tools and metrics

Evidence

OpenAI Dashboard — Detailed usage dashboard with costs, tokens, rate limits

highVerified: 2026-07-09

support quality

Support and documentation assessment

Evidence

OpenAI Support — 24/7 support, comprehensive docs, active developer community

highVerified: 2026-07-09

ecosystem maturity

Ecosystem breadth and depth analysis

Evidence

OpenAI Platform — Largest AI ecosystem; full variant family (Thinking, Pro, mini, nano) for cost/latency tiers

highVerified: 2026-07-09

license terms

Review of licensing terms and restrictions

Evidence

OpenAI Terms — Standard commercial terms with usage policies

highVerified: 2026-07-09

Strengths

+Native computer use: 75% OSWorld-Verified (vs GPT-5.2's 47.3%)
+~33% fewer factual errors than GPT-5.2
+~1.05M token input context with 128K output
+Full variant family: Thinking, Pro, mini, nano for cost/latency tiers
+Better price-performance than GPT-5.5 ($2.50/$15 vs $5/$30)
+Mature OpenAI ecosystem and tooling

Limitations

!Superseded by GPT-5.5 six weeks after release — previous-generation flagship
!Input pricing doubles above 272K input tokens
!Not HIPAA eligible
!30-day default API data retention
!Behind GPT-5.5 on reasoning and coding benchmarks (e.g., 75% vs 78.7% OSWorld-Verified)
!Computer use expands prompt-injection attack surface

Metadata

pricing

input: $2.50 per 1M tokens

output: $15.00 per 1M tokens

notes: Confirmed on official pricing page 2026-07-09 (Batch 50% discount). Variants: gpt-5.4-mini $0.75/$4.50, gpt-5.4-nano $0.20/$1.25, gpt-5.4-pro $30/$180 per 1M. Base-rate applies up to 272K input tokens; pricing doubles above that threshold (per 2026-06-10 verification).

last verified: 2026-07-09

context window: 1050000

max output: 128000

languages

0: English

1: Spanish

2: French

3: German

4: Italian

5: Portuguese

6: Japanese

7: Korean

8: Chinese

9: Russian

10: Arabic

11: Hindi

12: 50+ languages

modalities

0: text

1: vision

2: computer-use

api endpoint: https://api.openai.com/v1/responses

open source: false

architecture: Transformer-based with native computer-use capability and variant family

parameters: Not disclosed

knowledge cutoff: Late 2025

Use Case Ratings

code generation

Capable generalist coder with ~1.05M context, but GPT-5.3-Codex and GPT-5.5 lead on agentic coding benchmarks.

customer support

mini and nano variants give strong cost/latency tiers; ~33% fewer factual errors improves answer reliability.

content creation

Strong general writing with long-context coherence. Good value at $2.50/$15 vs GPT-5.5's $5/$30.

data analysis

Solid analytical reasoning with ~1.05M context for whole-dataset work; GPT-5.5 leads on frontier math.

research assistant

Reduced factual errors and very long context suit research; native computer use can drive live source gathering.

legal compliance

Good document analysis but not HIPAA eligible; 30-day default retention may be a concern.

healthcare

Not HIPAA eligible. Improved factuality helps clinical documentation but privacy controls remain a gap.

financial analysis

Strong quantitative reasoning at lower cost than GPT-5.5; computer use enables workflow automation across financial tools.

education

Reliable explanations with fewer factual errors; nano/mini tiers economical for high-volume tutoring.

creative writing

Capable narrative writing and good instruction following; flagship successors edge it on nuance.

Similar Models

GPT-5.5

OpenAI

GPT-5.3-Codex

OpenAI

GPT-5.2

OpenAI

Gemini 3.1 Pro

Google