Cursor Ships Composer 2: Frontier-Level Coding Performance at a Fraction of the Cost

Cursor released Composer 2 on March 19, and the pitch is less about being the best model on every benchmark and more about hitting a cost-to-intelligence ratio that changes how teams think about AI coding budgets.

Composer 2 scores 61.7 on Terminal-Bench 2.0, 73.7 on SWE-bench Multilingual, and 61.3 on CursorBench. It beats Claude Opus 4.6 (58.0 on Terminal-Bench) but trails GPT-5.4 (75.1). The pricing, though, is where the math shifts: $0.50 per million input tokens and $2.50 per million output tokens. That’s an 86% price reduction from Composer 1.5, which launched just last month at $3.50/$17.50. It’s also roughly 10x cheaper than Opus 4.6 per input token.

A faster variant ships at $1.50/$7.50 with identical intelligence and lower latency. Cursor is making the fast variant the default — a signal that they’re optimizing for the daily coding experience rather than benchmark bragging rights.

What Makes Composer 2 Different

The performance jump from Composer 1.5 to Composer 2 is unusually large. CursorBench went from 44.2 to 61.3. Terminal-Bench from 47.9 to 61.7. SWE-bench Multilingual from 65.9 to 73.7. Those aren’t incremental gains — a 17-point improvement on CursorBench in a single generation is the kind of leap that typically takes multiple release cycles.

Cursor attributes the improvement to two technical decisions. First, their initial, continued pretraining run provides a stronger base model before reinforcement learning begins. Second, training on long-horizon coding tasks using reinforcement learning, which enables the model to handle tasks requiring hundreds of sequential actions.

The long-horizon capability matters for practical work. A model that can sustain coherent execution over hundreds of steps can handle project-scale refactors, multi-file feature implementations, and complex debugging sessions — tasks where earlier models would lose context and produce inconsistent results.

Cursor’s approach to long-horizon execution includes what they call self-summarization — a technique in which the model pauses during extended tasks to compress its context to roughly 1,000 tokens, then resumes. Because this compression occurs within the reinforcement learning training loop, the model learns which information to retain and which to discard. This directly addresses the context management problem that Random Labs’ Slate architecture also targets, though through a fundamentally different mechanism — model-level self-compression rather than architectural thread boundaries.

The Competitive Picture

Composer 2 is the third Composer release since October 2025. The pace reflects Cursor’s position: The company hit a $2 billion annualized run rate in February 2026, has over 1 million daily users and 50,000 business customers, and is valued at $29.3 billion.

The AI coding model landscape has gotten crowded fast. GPT-5.4 launched earlier this month in GitHub Copilot. Anthropic’s Claude Opus 4.6 and Sonnet 4.6 power Claude Code. Google’s Gemini 3.1 Pro handles planning in Gemini CLI’s new plan mode. And now Cursor has its own frontier-class model trained specifically for coding.

What’s notable about Composer 2’s competitive positioning is what Cursor isn’t claiming. They’re not saying it’s the best model for everything. GPT-5.4 still leads on Terminal-Bench. The argument instead is about the Pareto frontier — the best combination of intelligence and cost for practical coding work.

For teams running AI coding agents at scale, token costs compound fast. An agent that generates 10 million output tokens per month costs $25 on Composer 2 standard versus roughly $150 on comparable frontier models. Multiply that across a 50-person engineering team, and the annual budget difference becomes significant. Lower token costs also mean teams can afford to let agents work longer on harder problems without watching the meter.

“Composer 2 reframes how AI coding tools compete. Cursor’s move to build and own its own model reflects a market where competitive advantage is anchored in controlling the full stack from model to interface, with token economics increasingly determining which platforms engineering teams can afford to run at scale,” according to Mitch Ashley, VP and practice lead for software lifecycle engineering at The Futurum Group.

“Underestimate costs at your own peril. Organizations deploying agents continuously, costs compound fast. Platform selection will hinge on stack ownership and cost structure; teams that defer that analysis will hit budget ceilings before capability ones.”

Why This Matters for DevOps

Composer 2 signals a shift in how AI coding tools compete. The first wave was about capability — can the model write code at all? The second wave was about agent workflows — can it handle multi-step tasks, open PRs, review code? This wave is about economics. Can you afford to run these agents continuously across your engineering organization?

Cursor’s answer is to build its own model optimized for its product. Composer 2 isn’t available as a standalone API — it only runs inside Cursor. That’s a platform lock-in play, but it’s also how they achieve the cost structure. By controlling the full stack from model to interface, they can optimize token efficiency, caching, and inference in ways that aren’t possible when consuming third-party models through generic APIs.

The self-summarization technique is worth watching independently of Cursor. We’ve covered context management across multiple articles this month — Random Labs’ thread-based episodic memory in Slate, Gemini CLI’s plan mode for read-only exploration, VS Code’s context compaction with guided retention. Cursor’s approach of training the model itself to compress context during long tasks is a fundamentally different strategy. Rather than building architectural scaffolding around the model’s context limitations, they’re teaching the model to manage its own memory. If the technique scales, it could reduce the need for some of the architectural complexity that other agent frameworks are building.

The broader trend is clear: AI coding tools are becoming vertically integrated. Cursor builds its own models. GitHub trains for Copilot’s specific use cases. Anthropic optimizes Claude for Claude Code. Google routes between Gemini models based on the task phase. The era of “plug in any model and get the same results” is ending. The model, the harness, the interface, and the economics are increasingly designed as a single system.

Composer 2 is available now in Cursor and in the early alpha of Cursor Glass, their new interface.

Read More

Scroll to Top