Agent Eval Harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

⭐ Star this repo to bookmark — fresh data every 15 minutes

English · 中文 · 日本語 · 한국어 · Español · Português

💡 What is this?

A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.

📋 Current Items

⏰ Last updated: 2026-06-25 02:08 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

#	Name	⭐	Lang	Updated	Description
1	…

score = signal / reach signal = 0.25·commit_velocity // commits in last 90 days (cap 30) + 0.20·contributor_work // unique authors × velocity (cap 100) + 0.20·issue_resolution // closed ÷ total issues + 0.20·fork_ratio // forks ÷ stars (proxy for real usage) + 0.10·release_cadence // releases in 90 days (cap 3) + age_bonus // +0 to +0.30 after 6 months + homepage_bonus // +0.05 if homepage is set reach = log₁₀(stars + watchers + 10)█

health = 0.35·recency // days since last commit (90d decay) + 0.25·cadence // commit rhythm consistency + 0.20·issue_health // closed ÷ total issues + 0.20·pr_health // merged ÷ total PRs█

Healthy 80 – 100 active, responsive, regular releases Stable 60 – 79 maintained, steady, no alarms Quiet 40 – 59 slowing down — watch this one At Risk 0 – 39 going dark · candidate for rescue█

solo_builder one person holds > 80% of commits (last 180d) needs_contributors has open "help wanted" or "good first issue" labels hidden_gem < 100 stars · active in last 3 months · documented legacy_hero repo > 5 years old · committed this year fork_magnet forks/stars > 0.5 · used as template or dependency release_machine 5+ releases in the last 90 days under_pressure > 10 open issues · ≤ 2 contributors · health ≥ 60 community_watch watchers > stars · devs tracking before the public community_hub GitHub Discussions enabled · > 20 discussions funded maintainer has active funding channel█