GitHub Copilot CLI Gets a Second Opinion — and It’s From a Different AI Family

Every developer knows the problem. You ask an AI coding agent to plan a solution, it looks reasonable, and you move forward. But somewhere in the execution, a flawed assumption gets baked in, and by the time you catch it, you’ve got a mess to unwind.

GitHub is taking a direct shot at that problem with a new experimental Copilot CLI feature called Rubber Duck.

What Rubber Duck Does

Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing the agent’s plans and work at the moments where feedback matters most.

The concept is straightforward. When a developer selects a Claude model as the primary orchestrator in the model picker, Rubber Duck runs on GPT-5.4. Different model families exhibit different training biases, so a review by a complementary family can surface errors that the primary model may consistently miss.

That’s the key insight here. It’s not just about adding a second set of eyes — it’s about adding a different kind of eyes. A model reviewing its own output can only catch what its training allows it to see. A model from a different family brings different assumptions, different blind spots, and different strengths.

The reviewer’s job is narrow: it produces a short list of concerns — assumptions the primary agent made without sufficient basis, edge cases that were overlooked, and implementation details that conflict with requirements elsewhere in the codebase.

The Performance Numbers

GitHub tested Rubber Duck against SWE-Bench Pro, a benchmark built around complex, real-world coding problems pulled from open-source repositories.

Claude Sonnet 4.6, paired with Rubber Duck running GPT-5.4, achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.

That’s a meaningful result. Opus is a significantly more capable — and more expensive — model than Sonnet. Getting close to Opus-level performance by pairing Sonnet with a lightweight cross-family reviewer suggests that model collaboration may be a more cost-effective strategy than simply reaching for the most powerful single model every time.

The effect grows with problem difficulty — on the hardest problems, it delivers a +4.8% improvement. That’s exactly where developers need the help most.

When Rubber Duck Kicks In

Rubber Duck can be triggered automatically or on demand. GitHub Copilot invokes it automatically at three checkpoints: after the agent drafts a plan, after a complex implementation, and after writing tests but before running them.

Each checkpoint is deliberate. Catching a bad plan early is far less costly than refactoring code built on it. And reviewing tests before executing them — rather than after — gives the agent a chance to fix gaps before it convinces itself everything passes.

The agent can also invoke the review if it gets stuck, or users can request a critique directly. GitHub said the system is designed to be selective, focusing on moments when the signal is strongest without slowing the overall workflow.

According to Mitch Ashley, VP and practice lead for software lifecycle engineering at The Futurum Group, “GitHub’s Rubber Duck is a shift in how development teams should evaluate AI agent tooling. Cross-family model collaboration reflects recognition that model-family training bias is a systemic risk in agent-driven workflows, and that review by a complementary family surfaces errors no single model consistently catches.”

“For engineering teams, the cost implications are direct. Pairing Claude Sonnet 4.6 with Rubber Duck closes 74.7% of the performance gap with Opus running solo, at lower cost. Teams managing AI spend across large development organizations cannot defer that calculus.”

What This Means for Development Teams

This is more than a product feature update. It signals a shift in how teams should think about AI-assisted development. The question is no longer just “which model should I use?” It may be “Which two models work best together?”

The future of agent design may be less about picking a single best model and more about picking the right pair. That’s a meaningful reframe for engineering teams evaluating AI tooling at scale.

There’s also a practical cost angle. Running Sonnet with Rubber Duck costs less than running Opus solo — and the performance gap closes considerably. For teams managing AI spend across large development organizations, that math is worth paying attention to.

How to Try It

Rubber Duck is available now in experimental mode in GitHub Copilot CLI. Developers access it by running the /experimental slash command. From there, select any Claude model from the model picker — Opus, Sonnet, or Haiku — and Rubber Duck will automatically pair with GPT-5.4 as the reviewer. Access to GPT-5.4 is required.

GitHub has said it’s exploring additional model-family pairings, including configurations in which GPT-5.4 serves as the primary orchestrator, with Rubber Duck drawing from a different family.

The experimental label means the feature is still evolving. But the results from initial testing suggest that the underlying approach — cross-family model collaboration at critical checkpoints — is onto something worth watching.