Automating Benchmark Analysis with Agent-Driven Development

In the world of software engineering, automation often comes full circle: you build a tool to eliminate drudgery, then find yourself maintaining that very tool. For one AI researcher at the Copilot Applied Science team, this cycle took an exciting new turn. Faced with the daunting task of analyzing thousands of agent performance trajectories from benchmarks like TerminalBench2, the researcher created eval-agents — a system that automates the intellectual toil of pattern discovery. Below, we dive into the details of this innovation through a series of questions.

What problem did the author face when analyzing coding agent benchmarks?

The author’s job revolves around measuring coding agent performance against standardized evaluation benchmarks such as TerminalBench2 or SWEBench-Pro. Each benchmark consists of many tasks, and every task produces a trajectory — a detailed record of the agent’s thought processes and actions. These trajectories are stored as JSON files, often containing hundreds of lines each. When you multiply that over dozens of tasks and multiple benchmark runs per day, you’re looking at hundreds of thousands of lines of code to manually review. No single person can efficiently identify important patterns across such a massive dataset. The researcher needed a way to surface meaningful insights without drowning in raw data.

Automating Benchmark Analysis with Agent-Driven Development — Source: github.blog

How did GitHub Copilot initially help with trajectory analysis?

Before building a dedicated tool, the researcher leaned on GitHub Copilot to sift through the noise. Copilot proved invaluable for spotting recurring patterns within the trajectories. Instead of reading hundreds of thousands of lines, the researcher could use Copilot to highlight promising areas, reducing the manual workload to just a few hundred lines worth investigating. This created a repetitive loop: use Copilot to find patterns, then investigate them. While effective, this cycle was still manual and time-consuming. The engineer inside saw an opportunity to automate the entire process, leading to the birth of a new solution.

What are agent trajectories and why are they difficult to analyze?

Agent trajectories are essentially logs that capture every step a coding agent takes while solving a task. They list the agent’s thought process, actions taken, and results — much like a transcript of its reasoning. In a typical benchmark evaluation, each task produces its own trajectory file (usually JSON format). These files can be hundreds of lines long, and with dozens of tasks per benchmark and many runs per day, the total volume can reach hundreds of thousands of lines. Manually combing through this data to find successes, failures, or performance trends is impractical. The difficulty lies not just in the volume, but in the need to correlate patterns across multiple trajectories to draw meaningful conclusions.

What is eval-agents and why was it created?

eval-agents is a custom tool built by the researcher to automate the intellectual labor of analyzing benchmark trajectories. The name reflects its purpose: agents that evaluate other agents. The tool is designed to automatically scan, summarize, and highlight key patterns across hundreds of thousands of lines of trajectory data. It was born from the frustration of repeating the same analysis cycle — using Copilot to surface patterns and then investigating them manually. By automating this loop, the researcher freed themselves and their team to focus on deeper insights and creative problem-solving, rather than rote data mining.

What were the three main goals for the eval-agents project?

The project was guided by three core principles: easy sharing and usage, easy authoring of new agents, and making coding agents the primary vehicle for contributions. The first two goals align with GitHub’s collaborative DNA and the researcher’s experience as an open-source maintainer of the GitHub CLI. They wanted anyone on the team to be able to run the analysis without a steep learning curve. The third goal ensures that contributions happen through agents themselves — new insights, fixes, or improvements can be coded as new agents, scaling the team’s ability to iterate on evaluation workflows. This framework turns the tool into a platform for collective intelligence.

How does eval-agents enable collaboration between engineering and science teams?

Engineering and science teams often speak different languages. Engineers build tools, while scientists analyze results. eval-agents bridges this gap by allowing both groups to contribute through a shared medium: coding agents. Engineers can craft agents that automate specific analyses, and scientists can use those agents without writing complex scripts. Moreover, the platform is designed so that new agents are easy to author and share, fostering a culture of contribution. This means that when a scientist discovers a new pattern worth tracking, they can turn that insight into an agent — or ask an engineer to do so. The result is a virtuous cycle where analysis becomes both automated and collaborative.

What does this approach mean for the future of development?

The creation of eval-agents represents a shift from automating manual, repetitive tasks to automating intellectual toil — the kind of thoughtful, pattern-recognition work that typically requires human expertise. By enabling agents to perform analysis that once demanded a researcher’s full attention, the team can accelerate their feedback loops and focus on higher-level questions. This approach hints at a future where developers and AI researchers routinely create specialized agents that collaborate on complex tasks, from debugging to performance tuning. Rather than replacing jobs, these tools redefine them—turning engineers into architects of automated intelligence, and scientists into directors of a digital research workforce.

Tags: