Thriving Thriving Python PY
51
agent-eval-harness linny006/agent-eval-harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

HEALTH 60 / Stable
// solo builder// help wanted agent-evaluationbenchmarkdev-toolsai-coding-agent-benchmarkcodex-vs-opencodeagent-evalai-benchmarksai-engineeringai-evaluationai-research

// readme

Agent Eval Harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

⭐ Star this repo to bookmark — fresh data every 15 minutes

English · 中文 · 日本語 · 한국어 · Español · Português


💡 What is this?

A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.


📋 Current Items

⏰ Last updated: 2026-06-25 02:08 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

#NameLangUpdatedDescription
1
The Undervalued Score +

How much a project earns versus how much attention it actually gets. Above 50 means the work is outrunning its audience. Recomputed nightly from commit velocity, contributor effort, issue resolution, fork utility, release cadence, and project maturity — divided by a logarithmic reach factor.

score  = signal / reach

signal = 0.25·commit_velocity   // commits in last 90 days (cap 30)
       + 0.20·contributor_work  // unique authors × velocity (cap 100)
       + 0.20·issue_resolution  // closed ÷ total issues
       + 0.20·fork_ratio        // forks ÷ stars (proxy for real usage)
       + 0.10·release_cadence   // releases in 90 days (cap 3)
       + age_bonus              // +0 to +0.30 after 6 months
       + homepage_bonus         // +0.05 if homepage is set

reach  = log₁₀(stars + watchers + 10)
The Health Score +

Is the project alive and maintained right now? A 0–100 pulse recomputed nightly from commit recency, rhythm, how fast issues close, and how quickly PRs get merged.

health = 0.35·recency       // days since last commit (90d decay)
       + 0.25·cadence       // commit rhythm consistency
       + 0.20·issue_health  // closed ÷ total issues
       + 0.20·pr_health     // merged ÷ total PRs
Health bands +

The colour and label on every card come straight from the health score.

Healthy   80 – 100   active, responsive, regular releases
Stable    60 – 79    maintained, steady, no alarms
Quiet     40 – 59    slowing down — watch this one
At Risk    0 – 39    going dark · candidate for rescue
// Tags — what each label means +

Tags are independent behavioral signals computed nightly. A project can hold multiple at once. They drive the home page sections.

solo_builder      one person holds > 80% of commits (last 180d)
needs_contributors has open "help wanted" or "good first issue" labels
hidden_gem        < 100 stars · active in last 3 months · documented
legacy_hero       repo > 5 years old · committed this year
fork_magnet       forks/stars > 0.5 · used as template or dependency
release_machine   5+ releases in the last 90 days
under_pressure    > 10 open issues · ≤ 2 contributors · health ≥ 60
community_watch   watchers > stars · devs tracking before the public
community_hub     GitHub Discussions enabled · > 20 discussions
funded            maintainer has active funding channel
Why rank against stars at all? +

Stars are an outcome, not effort. A project with 8 stars and daily commits is doing more interesting work than one coasting on 8k. We measure the building, then divide by the attention already received — so the genuinely undervalued rise to the top.

// stars   = lagging indicator
// commits = leading indicator
// we rank by the leading one