Ornith Models Automate Agentic Coding With Self-Scaffolding

Ornith, a new family of open source LLM models from the DeepReinforce research collective, takes a novel approach to executing coding and debugging tasks: It generates an architectural framework to give the user’s harness a structured instruction set – a scaffold – to create an agent to complete the job.

Available in a set of four variants, the Ornith family was trained to work comfortably with complex software repositories undertaking complicated long-horizon jobs. Sure, LLMs can do these tasks now – until the job gets too complex. Ornith’s self-generated scaffolding ensures that it doesn’t forget the plot along the way.

“The model continuously improves not only its code generation abilities but also the orchestration strategy used to solve software engineering problems,” wrote AI tutorial engineer Mehul Gupta, in an introductory post.

Deep Reinforcement Expansion Pack

Ornith reads the user’s instruction, but instead of executing it directly it builds a scaffold, a learnable object. The scaffold serves as a place where Ornith can design –and refine – the architecture for the job.

According to Gupta, the scaffold is where the LLM can detail the reasoning sequences, memory organization, debugging strategy, tool invocation order and execution planning. The user’s harness then interprets the scaffold to generate an agent to execute the task.

When the job is finished, the scaffold is deleted. When a new task comes up, Ornith builds a fresh scaffold to execute that job.

“By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions,” the researchers state in a post.

Ornith builds the scaffolding from a set of rules developed in the model during training time. These models were built from an exhaustive self-learning process that used deep reinforcement learning techniques to computationally rotate through the possible ways of addressing an issue.

Four Models

Ornith’s four variants are: 9B Dense, 31B Dense, 35B MoE and 397B MoE. The “Dense” models activate every parameter (as measured by the “B” in the name), whereas the MoE (Mixture of Experts) models activate only the parameters needed based on their relevance for the task, though they have additional reasoning tools for specialized functions.

Each of the variants are built atop the open source Gemma 4 and Qwen 3.5, allowing the researchers to layer coding-specific deep RL rules over those models’ inherent fluency in language and world knowledge.

The dense models are best suited for running on local hardware. Ideal for a laptop, 9B Dense can write small scripts and execute various single-file cleanup tasks, whereas the 31B Dense requires a full workstation with up to 48GB of VRAM, but can internalize a full view of a complicated multi-file repository for tougher problems.

The MoE variants are best run in the cloud. The 35B MoE is perhaps best suited for quick continuous integration patching and code review. The 397B MoE is the flagship model, a competitor to Opus 4.7, in the organization’s estimation. This behemoth requires a cluster of GPUs to run smoothly, and can tackle the hardest coding problems.

Killer Performance

With this diversity of models, Ornith’s performance metrics are “just killing it all over the place,” with impressive marks across small, middle and large LLM categories, observed the Hyderabad, Telangana-based Data Science in Your Pocket YouTube channel. This is “a breakthrough … one of a kind model,” they noted.

In company tests, Ornith-1.0-397B outperformed Claude Opus 4.7 on Terminal-Bench 2.1, a benchmark for LLMs in terminal environments, scoring 77.5 to Claude’s 70.3.

Likewise, Ornith-1.0-35B significantly outperforms similar mid-sized models, including Qwen 3.5 (9 billion parameters) and Gemma 4 (12 billion parameters). It even rivaled the 31-billion-parameter Gemma 4 model.