Live, open-source benchmark for comparing AI coding agents on real GitHub issues
// readme
Agent Eval Harness
Live, open-source benchmark for comparing AI coding agents on real GitHub issues
⭐ Star this repo to bookmark — fresh data every 15 minutes
English · 中文 · 日本語 · 한국어 · Español · Português
💡 What is this?
A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.
This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.
📋 Current Items
⏰ Last updated: 2026-06-25 02:08 UTC
Data source:
GitHub Search APIThe table below is rewritten on every cron tick. Star the repo to bookmark.
| # | Name | ⭐ | Lang | Updated | Description |
|---|---|---|---|---|---|
| 1 | … |