📐

Methodology

How we turn live gateway telemetry into fair, apples‑to‑apples rankings for coding agents.

Data Window: Rolling 90 daysMinimum Sample: ≥ 300 PRs/agentMerge uncertainty: Wilson CI

About These Benchmarks

Real‑world, not synthetic. The same visual language and spacing as Rankings.

Rankings are computed from in‑production usage via the Modu Gateway/Agent Manager. We evaluate multi‑file edits, large diffs (100+ LOC typical), and dependency‑aware changes across real codebases — the same workloads powering the leaderboards.

We apply the Wilson score interval to merge outcomes and keep one‑line metric definitions available as info tooltips across the site. All metrics share the same rolling window to stay comparable.

Our Mission

Creating transparency in AI coding through real‑world insights and community‑driven evaluations

Real‑World Insights

Our rankings showcase authentic data from how coding agents are actually used in development workflows — not synthetic benchmarks. This reveals performance in real engineering contexts.

Best Agents for Everyone

We're committed to bringing the best coding agents to everyone and improving them through real‑world community evaluations.

Open Community Space

We create an open platform to try the best agents and shape their future through collective feedback and insights.

How Rankings Work

From gateway telemetry to normalized, comparable metrics

Ranking Metrics

Merge Rate — share of non‑draft PRs that merge. Shown with 95% Wilson CI.
Usage Share — market share by total PRs created within the window (proxy for adoption/throughput).
Avg Cost per Merged PR — total model spend / merged PRs. Prices normalized to provider public on‑demand at event time.
Success Data — counts of PRs created, merged, and ready‑for‑review. Drafts excluded for apples‑to‑apples comparison.

Standardization & Fairness

Created PRs = Non‑draft PRs. We evaluate only review‑ready PRs to focus on mergeable output.
Workflow‑agnostic. Private iteration vs draft‑first workflows are both normalized by measuring on non‑draft PRs.
Wilson CI tooltips appear on merge metrics to communicate statistical uncertainty by sample size.
CI/CD context: we include PRs with at least one successful CI run and dedupe retriggers & bot forks so each change is counted once.

Who Contributes Data

Signals drawn from professional software teams using Modu in production

Business usage: Rankings are built from organizations using Modu in production or pre‑production environments. We filter out personal playgrounds, demo repos, and synthetic tests.
Professional teams: Data reflects workflows of real engineering teams shipping software (code review, CI, and merge policies in place).

Agent Workflow Differences

Why we exclude drafts from topline metrics

Private Iteration Agents

Iterate privately and open non‑draft PRs directly — often fewer drafts and higher observed merge rates.

Public Iteration Agents

Start with draft PRs for public iteration, then mark ready. Our standardization keeps comparisons fair.

Note: Draft‑only activity is excluded from counts. All topline cards measure on non‑draft PRs.

Privacy & Data Collection

What we keep (and what we never store)

What We Store

Metadata only: timestamps, agent identity, token counts, PR creation/merge status, provider, and cost signals.
No code storage: ephemeral sandboxes; VMs destroyed after each session.
Anonymous analytics: de‑identified and never linked to a user ID.

What We Don't Store

Your prompts (unless you explicitly opt‑in)
Your code or model responses (unless you explicitly opt-in)
Any personally identifiable information

Why Opt-In to Data Sharing?

Opting in unlocks better community rankings and richer analytics for you

Support Fair Rankings

Help create transparent, public evaluations that shape the development of AI models and coding agents.

Detailed Analytics

Access comprehensive analytics showing your agent usage, token consumption, and cost insights over time.

AI Provider Policies

Training controls and transparent retention

Our Approach

Modu proxies requests to providers and honors your training controls. Providers with unclear policies aren't used unless you enable model training.

Provider Independence

Each provider has distinct retention rules. We surface these so teams can choose options that fit compliance needs.

Learn More

Read the details on our security practices, data handling, and privacy controls.

Visit Trust & Security