GPT-5.4 is OpenAI's AI model released March 5, 2026. It unifies reasoning, coding, and agentic computer-use capabilities in a single model. It has a 1 million token context window and is the first model to surpass the human expert baseline on the OSWorld desktop automation benchmark.

What is the OSWorld benchmark?

OSWorld is a benchmark that tests AI models on general desktop work: clicking buttons, filling forms, navigating file systems, and using web browsers to complete tasks across common applications. The human expert baseline is 72.4%. GPT-5.4 scored 75%, the first model to exceed the human baseline.

What does beating humans on OSWorld mean in practice?

It means a single AI model can now complete general computer tasks — browsing, filing, spreadsheets, file management — more reliably than a typical human tester on benchmark tasks. Real-world deployment is still gated by safety, permissions, and error recovery, but the capability threshold has been crossed.

Is GPT-5.4 available to use?

Yes. GPT-5.4 is available through the OpenAI API at roughly $2.50 per million input tokens. It is also available in ChatGPT for Plus and Enterprise users. Computer-use capabilities require the agentic mode and proper sandboxing.

GPT-5.4 Just Beat Humans at Using a Computer

For the first time, an AI model is more reliable at using a computer than the human experts testing it. OpenAI's GPT-5.4 scored 75% on OSWorld — a benchmark that asks models to complete real desktop tasks in real operating environments. The human expert baseline on the same tasks is 72.4%. No model had ever crossed that line before. GPT-5.4 didn't just cross it — it cleared it by a comfortable margin.

This matters for a specific, narrow reason that is easy to miss in the headline. It isn't that GPT-5.4 is smarter than humans at everything — it isn't. It's that the specific combination of capabilities required for agentic computer use — vision, planning, tool invocation, error recovery — has quietly been getting better for months. March was the month the line crossed. If you build with AI agents, you need to understand what actually changed.

The 5-Second Version

GPT-5.4 scored 75% on OSWorld — the first model to exceed the 72.4% human expert baseline.
OSWorld tests desktop automation: clicking, forms, file systems, browsers across real applications.
Progression is steep: GPT-5.2 scored 47.3%, GPT-5.3-Codex 64%, GPT-5.4 75%. That's nearly 30 points in six months.
Also shipped: 1M token context, unified architecture (reasoning + coding + agents), native tool search.
Available through the OpenAI API at ~$2.50 per million input tokens.
Real-world deployment still gated by safety, permissions, and error-recovery engineering.

Why OSWorld Is the Benchmark That Matters

Benchmarks age fast in AI. A year ago, "passing" MMLU was considered remarkable. Today most frontier models saturate it. OSWorld is different for one reason: it measures whether a model can actually do computer work, not just reason about it.

OSWorld Progression, OpenAI Models (2025–2026)

Score = percentage of benchmark tasks completed correctly

Human expert

72.4%

GPT-5.2 (Nov 2025)

47.3%

GPT-5.3-Codex (Jan 2026)

64%

GPT-5.4 (Mar 2026)
75%

Source: OpenAI, OSWorld official leaderboard, March 2026

The tasks on OSWorld aren't trivia or math problems. They're things like: "Open this spreadsheet, copy the total from row 14 into the email draft in the other window, attach the PDF from the downloads folder, and send it." Completing that reliably requires coordinating vision (find the cell), planning (sequence the steps), tool use (switch applications), and error recovery (what if the copy fails?).

That's why a 27-point jump from GPT-5.2 to GPT-5.4 in six months is the number to pay attention to. On MMLU that jump would mean nothing. On OSWorld it means AI agents went from unreliable to usable in a single product cycle.

What Else Is New

OSWorld is the headline but GPT-5.4 shipped with three other upgrades that matter for builders.

1 Million Token Context

Full codebases, entire meeting transcripts, or an entire day's worth of operational logs can go into a single call. This collapses a lot of the complexity of RAG for medium-size retrieval problems.

Think "fit it all in one call"

Unified Architecture

Previous GPT-5 generations had separate specialists: GPT-5 for chat, Codex for code, o-series for reasoning. GPT-5.4 folds all of them into one model, chosen at runtime. One endpoint, fewer decisions.

One model for everything

Native Tool Search

Instead of loading every tool definition into context up front, the model is given a lightweight list and looks up full definitions on demand. Tool-heavy agents with hundreds of tools become viable.

Scale past 100 tools without choking context

Computer Use Built In

Vision, UI reasoning, and action-taking are no longer a separate product. They're a mode of the base model. Turn it on, give it a sandbox, and GPT-5.4 can actually drive a browser or a desktop.

Agents are first-class, not an add-on

What It Looks Like in Code

If you've never wired up a computer-use agent, this is roughly the shape of the workflow. You give GPT-5.4 a scoped environment, a goal, and let it drive.

computer_use.py

Python

from openai import OpenAI
from sandbox import DesktopEnv

client = OpenAI()
env    = DesktopEnv(snapshot="ubuntu-24.04-office")

response = client.responses.create(
    model="gpt-5.4",
    tools=[{"type": "computer_use", "env": env.handle}],
    input="Open the spreadsheet on the desktop. Copy the "
          "total from row 14 and paste it into the draft email "
          "in Thunderbird. Attach the latest PDF from Downloads "
          "and send it to [email protected]."
)

for step in response.steps:
    print(f"{step.action}: {step.target} → {step.result}")

The real work in production isn't the API call — it's the sandbox. Giving an AI model permission to use a computer is a security design problem, not a prompt engineering problem. Scope narrowly. Log everything. Never run it against your actual filesystem on first attempt.

What Builders Should Actually Do This Week

Three concrete moves, in order of impact.

First, try it on a real workflow you have. Not a demo. A real internal process that eats an hour of someone's time every week. Something like "pull data from three SaaS tools and assemble a weekly report." Give GPT-5.4 computer-use access in a sandbox, give it the goal, and watch what happens. You'll learn more in an hour than you will from any benchmark.

Second, audit your tool-definition bloat. If you have an agent today that uses 20+ tools, your context is getting eaten by tool schemas. Native tool search lets you push that to 100+ without tanking latency or cost. Move old tool definitions out of the system prompt and into the tool-search registry.

Third, think about the sandbox. If computer-use agents become a real part of your stack, the bottleneck stops being model quality and starts being environment isolation. Start reading up on how to ship a secure sandbox now. The tooling is immature and it's where most production failures are going to happen in 2026.

The Bottom Line

The Verdict

OSWorld is the benchmark that matters now, and GPT-5.4 just crossed the human line. Agents went from unreliable to usable in one release. Build something with it this week — the first movers are going to compound fast.

Every major capability breakthrough in AI has the same shape: a line gets crossed quietly, then a year later people notice everything that was downstream of it has changed. GPT-5.4 on OSWorld is one of those lines. The builders who pick it up first are going to be the ones shipping genuinely useful agents in 2026 while everyone else is still reading recap articles.

Learn to Build Real AI Agents — Not Just Read About Them

The 2-day in-person Precision AI Academy bootcamp teaches agent engineering hands-on. 5 cities. $1,490. 40 seats max. June–October 2026 (Thu–Fri).

Reserve Your Seat

Our Take

The benchmark matters less than what breaks first in production.

75% on OSWorld is a real milestone, but the more important number is how quickly that performance degrades outside benchmark conditions. The gap between "passes the test" and "reliable enough to touch a customer's production environment" is substantial for computer-use agents. Benchmark tasks are clean, well-defined, and reversible. Real desktop workflows involve ambiguous UI states, applications that update themselves mid-session, and error conditions that require judgment rather than pattern-matching. Our expectation is that real-world enterprise deployments will run at 40–55% success rates on genuinely complex workflows for another 12–18 months.

That said, the specific workflows where computer-use agents already work reliably at better-than-human rates are worth naming: structured data entry across known UI layouts, regression testing of web applications, and repetitive file-management operations. These aren't glamorous, but they represent real labor cost reduction for companies willing to build the scaffolding. Browser-use automation (Playwright-backed agents) and RPA incumbents like UiPath and Automation Anywhere should be watching this closely — GPT-5.4's capability at $2.50 per million tokens threatens the $50K+ per automation license model that RPA vendors depend on.

For builders, the near-term opportunity is narrow but concrete: pick one repetitive computer task that currently takes a human 30+ minutes and build a well-sandboxed agent loop around it. Don't try to automate an entire workflow end-to-end yet. The model is good enough for bounded, recoverable tasks; it's not ready for open-ended, high-stakes sessions without a human in the loop.

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts