v2 evaluates capability across five difficulty bands — TRIVIAL → EASY → MEDIUM → HARD → EXPERT. The headline is a weighted composite; bands are not designed to saturate at 1.0 (EXPERT cases discriminate frontier models). Each score carries a 95% bootstrap CI, and judging uses a council of 7 diverse LLM judges with trimmed mean + same-family weighting. Subset/smoke runs are visible on /runs but excluded from the leaderboard. Full methodology →
01LEADERBOARD
| RANK | MODEL | OVERALL | ||||
|---|---|---|---|---|---|---|
| 01 | Deepseek V4 Flashdeepseek n=90 · all bands | 76.7 | 2.70¢ | 935ms | 155ms | 2026-05-02 21:03:45Z |
| 02 | Gpt 5 4 Nanoopenai n=90 · all bands | 67.0 | 3.10¢ | 1699ms | 319ms | 2026-05-02 21:03:45Z |
| 03 | Mistral Nemomistral n=90 · all bands | 49.8 | 4.70¢ | 1515ms | 135ms | 2026-05-02 21:03:45Z |
3 MODELS · SNAPSHOT v2-latest-overallGENERATED Sat, 02 May 2026 21:03:45 GMT
Q0.976LATp50 1.2s · p95 2.7s · p99 5.1sJUDGEgpt-4.1-miniQ-DEPTH0$/EVAL$0.00014