HOW TO READ THE LEADERBOARD

v2 evaluates capability across five difficulty bands — TRIVIAL → EASY → MEDIUM → HARD → EXPERT. The headline is a weighted composite; bands are not designed to saturate at 1.0 (EXPERT cases discriminate frontier models). Each score carries a 95% bootstrap CI, and judging uses a council of 7 diverse LLM judges with trimmed mean + same-family weighting. Subset/smoke runs are visible on /runs but excluded from the leaderboard. Full methodology →

01LEADERBOARD
RANKMODELOVERALL
01
n=90 · all bands
76.7
02
n=90 · all bands
67.0
03
n=90 · all bands
49.8
3 MODELS · SNAPSHOT v2-latest-overallGENERATED Sat, 02 May 2026 21:03:45 GMT