Where the pitch started#
In March 2023, Toran Bruce Richards released AutoGPT on GitHub. The premise was compelling: wire GPT-4 into a planning loop, let it break goals into sub-tasks, give it tools, and you get an autonomous worker. The project hit the front page of Hacker News, then the top of GitHub trending, and is now sitting at around 183,000 stars (Significant-Gravitas/AutoGPT).
For a few weeks in spring 2023, "autonomous AI agent" looked like the shape the next wave of products would take. Give an AI a goal, come back to a finished result. The press ran with it. Venture capital chased it.
Three years later, the companies and projects that bet on the strong version of that pitch have mostly pivoted or quietened. The companies and projects that built the narrow version have become some of the fastest-revenue-growing software businesses in history. The gap is worth looking at carefully, because the evidence is clearer than the narrative.
What happened to the 2023 class#
A short inventory of the agent projects from that first wave.
AutoGPT pivoted to AutoGPT Platform, a low-code workflow builder with Agent Builder, Forge, and agbenchmark (current repo). The original autonomous-loop architecture was replaced with configurable agents inside defined workflows — essentially the opposite of the original pitch. The pivot was an admission that the general case did not work at production scale.
BabyAGI was a few hundred lines of Python wiring GPT-4 to a task-queue loop, released by Yohei Nakajima. The original repo is now archived as babyagi_archive. Nakajima shipped BabyAGI 2 (functionz) in September 2024, but the project is effectively research-scale, not a production platform.
AgentGPT / reworkd pivoted hard. According to TechCrunch in July 2024, reworkd abandoned the general-agent product because AgentGPT was costing them around $2,000 a day in API calls without finding paying customers. They refocused on AI-powered structured web-scraping, raised $2.75M from Paul Graham, Nat Friedman/Daniel Gross's AI Grant, SV Angel and General Catalyst, and now sell a domain-specific product.
Devin, from Cognition, was the most ambitious 2024 attempt at a general-agent product. It launched in March 2024 with a flashy Upwork demo and framing as "the first AI software engineer". The demo was debunked publicly — independent analysis found the bugs Devin "fixed" in the demo were not real bugs in the repository. Cognition walked back the strongest claims and softened messaging. By early 2026, Devin had become a more bounded product, Cognition had acquired Windsurf, and company revenue was reported above $73 million ARR. That is a serious business, but it is not what the original pitch promised.
The pattern across all four: the general-agent premise did not survive contact with paying customers. The teams that survived either pivoted to a narrower product or accepted much more modest claims about what their general agent could do.
What the benchmarks say#
This is not just anecdote. The GAIA benchmark, introduced in the paper arXiv:2311.12983, measures agent performance on 466 questions across three difficulty levels, testing real-world problem solving with tool use. The humans-vs-frontier-agent gap is documented, not theoretical.
Current state on the Princeton HAL GAIA leaderboard: Claude Sonnet 4.5 leads at around 74.6%, with Anthropic models occupying the top six positions. Humans sit around 92%. This is the best-case performance on a static benchmark.
On Gaia2, a harder asynchronous variant introduced in 2025 (OpenReview), the current top score (GPT-5 at high reasoning) is around 42% pass@1. Frontier agents fail on time-sensitive tasks and on coordination across long timelines.
The useful way to read these numbers: on static, well-defined agent benchmarks the frontier has genuinely moved from 15% (GPT-4 with plugins, 2023) to 75% (Claude Sonnet 4.5, 2026). On harder benchmarks that capture real-world messiness, performance drops by half. And the human baseline stays above 90%, which is where most production applications need to be.
This is progress, meaningful progress, but it is not "the agents can do your job" progress.
What quietly became huge#
While the general-agent projects were pivoting, narrow-scope agent products built real businesses.
Cursor surpassed $2 billion ARR by early 2026 (TechCrunch, March 2026), with over one million daily active users and around 360,000 paying customers. Cursor is an agent in the technical sense — it plans, calls tools, loops — but it is scoped to editing a codebase with a human reviewing every change.
Claude Code reached roughly $2.5 billion run-rate by early 2026, hitting $1 billion ARR six months after launch, according to industry coverage including uncoveralpha.com. Same pattern: narrow scope, artifact output, human supervision.
Replit grew from around $10 million ARR at the end of 2024 to $240 million over 2025, with a $9 billion valuation reported in 2026. Replit Agent is the growth engine. Again, narrow scope, bounded by what Replit can deploy.
GitHub Copilot reports roughly 20 million users and about $800 million ARR, with its coding-agent features growing alongside the base product.
Four businesses in the same shape: bounded domain, artifact-producing, human in the review loop. Collectively, something like $5 billion in annualised revenue in early 2026. None of them market themselves with the AutoGPT pitch. All of them are doing what AutoGPT tried to do, just with the ambition scaled to what actually works.
Computer Use as the transitional case#
Anthropic's Computer Use is an interesting test case. It launched in October 2024 as a beta feature letting Claude take actions in a sandboxed virtual machine — click, type, navigate. The consumer rollout happened in March and April 2026, expanding to Mac and Windows within the Claude app for Pro and Max subscribers.
It is impressive, and it clearly points at the direction things are moving. It is also, in practice, a research preview with documented limitations. It is slow. It is vulnerable to prompt injection in browsing contexts. It requires user oversight. Anthropic's own documentation is explicit about the need to stop the model and verify at checkpoints. This is not the autonomous-worker product the 2023 pitch promised, and Anthropic is unusually open about the gap between what it does and what a fully general agent would require.
What works and why#
Across the production winners (Cursor, Claude Code, Replit Agent, Copilot, and the narrow workflow tools like Lindy and Relevance AI), a consistent pattern:
- Bounded domain. The agent operates inside a clearly specified area — code in a repo, actions in a workflow template, steps in a research document — rather than the open world.
- Artifact output. Each step produces something a human can read and accept or reject before the next step runs. Text. Code. A change proposal.
- Human review as a design assumption, not a failure mode. The products are built on the premise that the user is in the loop. When the user steps away, the agent stops.
- Failure is recoverable. Because output is artifact, not action-in-the-world, a mistake is a bad suggestion, not a wrong email sent or a wrong charge processed.
The 2023 class tried to escape all four of these. The results are documented in the pivots and the benchmark gaps.
Where I think this goes next#
Computer Use, both Anthropic's and the equivalents being worked on at OpenAI and Google, is where the experimentation is now. It is honest about being a preview, the limitations are documented, and the failure modes are being studied rather than glossed. Whether it becomes a real product category depends on whether the prompt-injection and reliability problems get solved at a level that supports actions-with-consequences.
The Gaia2 numbers suggest the problem is still hard. Frontier reasoning models at 42% on a realistic benchmark is not a production-ready score. It is a direction.
In the meantime, the practical advice for anyone building or buying agent-shaped products in 2026 is unchanged from what the evidence shows:
- Prefer narrow scope. If a product pitch starts with "general-purpose", be skeptical.
- Prefer artifact output. Tools that produce something you can read and accept are safer than tools that act in the world.
- Prefer observability. You should be able to see every step the agent took and understand why.
- Distrust marketing that ignores the benchmark gap. The best GAIA scores are still 20 points below human. Real production is a harder benchmark than GAIA.
The revolution did arrive. It just arrived in a different shape than the headlines promised.
Further reading#
- Vibe Coding Is a Lie for related data on AI coding productivity.
- MCP Is More Important Than You Think for the infrastructure shift enabling the next wave.
- Cursor to Claude Code and Back for the head-to-head on the two winning agent-shaped coding products.
Sources#
- AutoGPT repo: https://github.com/Significant-Gravitas/AutoGPT
- BabyAGI archive: https://github.com/yoheinakajima/babyagi_archive
- BabyAGI 2 (functionz): https://github.com/yoheinakajima/babyagi
- Reworkd pivot coverage, TechCrunch July 2024: https://techcrunch.com/2024/07/24/reworkd-paul-graham-nat-friedman-daniel-gross-scrape-ai-agents/
- Pragmatic Engineer on Devin walkback: https://newsletter.pragmaticengineer.com/p/the-pulse-90
- Cognition AI background: https://en.wikipedia.org/wiki/Cognition_AI
- GAIA paper: https://arxiv.org/abs/2311.12983
- Princeton HAL GAIA leaderboard: https://hal.cs.princeton.edu/gaia
- Gaia2 OpenReview: https://openreview.net/forum?id=9gw03JpKK4
- Cursor $2B ARR, TechCrunch: https://techcrunch.com/2026/03/02/cursor-has-reportedly-surpassed-2b-in-annualized-revenue/
- Claude Code growth coverage: https://www.uncoveralpha.com/p/anthropics-claude-code-is-having
- Replit valuation breakdown: https://www.buildmvpfast.com/blog/replit-9b-valuation-agentic-coding-vibe-coding-2026
