GPT-5.4 Just Beat Humans at Using a Computer

75% on OSWorld versus a 72.4% human baseline. For the first time in AI history, a model is more reliable than a human tester at general desktop work. What that actually means.

OSWORLD SCORE HUMAN 72.4% GPT-5.4 75.0% GPT-5.3 64.0% GPT-5.2 47.3%
75%
OSWorld score
72.4%
Human baseline
1M
Token context
$2.50
Per million in-tokens

For the first time, an AI model is more reliable at using a computer than the human experts testing it. OpenAI's GPT-5.4 scored 75% on OSWorld — a benchmark that asks models to complete real desktop tasks in real operating environments. The human expert baseline on the same tasks is 72.4%. No model had ever crossed that line before. GPT-5.4 didn't just cross it — it cleared it by a comfortable margin.

This matters for a specific, narrow reason that is easy to miss in the headline. It isn't that GPT-5.4 is smarter than humans at everything — it isn't. It's that the specific combination of capabilities required for agentic computer use — vision, planning, tool invocation, error recovery — has quietly been getting better for months. March was the month the line crossed. If you build with AI agents, you need to understand what actually changed.

The 5-Second Version

01

Why OSWorld Is the Benchmark That Matters

Benchmarks age fast in AI. A year ago, "passing" MMLU was considered remarkable. Today most frontier models saturate it. OSWorld is different for one reason: it measures whether a model can actually do computer work, not just reason about it.

OSWorld Progression, OpenAI Models (2025–2026)

Score = percentage of benchmark tasks completed correctly
Human expert
72.4%
GPT-5.2 (Nov 2025)
47.3%
GPT-5.3-Codex (Jan 2026)
64%
GPT-5.4 (Mar 2026)
75%
Source: OpenAI, OSWorld official leaderboard, March 2026

The tasks on OSWorld aren't trivia or math problems. They're things like: "Open this spreadsheet, copy the total from row 14 into the email draft in the other window, attach the PDF from the downloads folder, and send it." Completing that reliably requires coordinating vision (find the cell), planning (sequence the steps), tool use (switch applications), and error recovery (what if the copy fails?).

That's why a 27-point jump from GPT-5.2 to GPT-5.4 in six months is the number to pay attention to. On MMLU that jump would mean nothing. On OSWorld it means AI agents went from unreliable to usable in a single product cycle.

02

What Else Is New

OSWorld is the headline but GPT-5.4 shipped with three other upgrades that matter for builders.

01

1 Million Token Context

Full codebases, entire meeting transcripts, or an entire day's worth of operational logs can go into a single call. This collapses a lot of the complexity of RAG for medium-size retrieval problems.

Think "fit it all in one call"
02

Unified Architecture

Previous GPT-5 generations had separate specialists: GPT-5 for chat, Codex for code, o-series for reasoning. GPT-5.4 folds all of them into one model, chosen at runtime. One endpoint, fewer decisions.

One model for everything
03

Native Tool Search

Instead of loading every tool definition into context up front, the model is given a lightweight list and looks up full definitions on demand. Tool-heavy agents with hundreds of tools become viable.

Scale past 100 tools without choking context
04

Computer Use Built In

Vision, UI reasoning, and action-taking are no longer a separate product. They're a mode of the base model. Turn it on, give it a sandbox, and GPT-5.4 can actually drive a browser or a desktop.

Agents are first-class, not an add-on
03

What It Looks Like in Code

If you've never wired up a computer-use agent, this is roughly the shape of the workflow. You give GPT-5.4 a scoped environment, a goal, and let it drive.

computer_use.py
Python
from openai import OpenAI
from sandbox import DesktopEnv

client = OpenAI()
env    = DesktopEnv(snapshot="ubuntu-24.04-office")

response = client.responses.create(
    model="gpt-5.4",
    tools=[{"type": "computer_use", "env": env.handle}],
    input="Open the spreadsheet on the desktop. Copy the "
          "total from row 14 and paste it into the draft email "
          "in Thunderbird. Attach the latest PDF from Downloads "
          "and send it to [email protected]."
)

for step in response.steps:
    print(f"{step.action}: {step.target} → {step.result}")

The real work in production isn't the API call — it's the sandbox. Giving an AI model permission to use a computer is a security design problem, not a prompt engineering problem. Scope narrowly. Log everything. Never run it against your actual filesystem on first attempt.

04

What Builders Should Actually Do This Week

Three concrete moves, in order of impact.

First, try it on a real workflow you have. Not a demo. A real internal process that eats an hour of someone's time every week. Something like "pull data from three SaaS tools and assemble a weekly report." Give GPT-5.4 computer-use access in a sandbox, give it the goal, and watch what happens. You'll learn more in an hour than you will from any benchmark.

Second, audit your tool-definition bloat. If you have an agent today that uses 20+ tools, your context is getting eaten by tool schemas. Native tool search lets you push that to 100+ without tanking latency or cost. Move old tool definitions out of the system prompt and into the tool-search registry.

Third, think about the sandbox. If computer-use agents become a real part of your stack, the bottleneck stops being model quality and starts being environment isolation. Start reading up on how to ship a secure sandbox now. The tooling is immature and it's where most production failures are going to happen in 2026.

The Bottom Line

The Verdict
OSWorld is the benchmark that matters now, and GPT-5.4 just crossed the human line. Agents went from unreliable to usable in one release. Build something with it this week — the first movers are going to compound fast.

Every major capability breakthrough in AI has the same shape: a line gets crossed quietly, then a year later people notice everything that was downstream of it has changed. GPT-5.4 on OSWorld is one of those lines. The builders who pick it up first are going to be the ones shipping genuinely useful agents in 2026 while everyone else is still reading recap articles.

Learn to Build Real AI Agents — Not Just Read About Them

The 2-day in-person Precision AI Academy bootcamp teaches agent engineering hands-on. 5 cities. $1,490. 40 seats max. June–October 2026 (Thu–Fri).

Reserve Your Seat
PA

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts