Golden-Eval

HOW TO READ THE LEADERBOARD

v2 evaluates capability across five difficulty bands — TRIVIAL → EASY → MEDIUM → HARD → EXPERT. The headline is a weighted composite; bands are not designed to saturate at 1.0 (EXPERT cases discriminate frontier models). Each score carries a 95% bootstrap CI, and judging uses a council of 7 diverse LLM judges with trimmed mean + same-family weighting. Subset/smoke runs are visible on /runs but excluded from the leaderboard. Full methodology →

§01.LEADERBOARD

RANK	MODEL	OVERALL	COST/MTOK	p95	TTFT	LAST SEEN
01	Deepseek V4 Flashdeepseek n=90 · all bands	76.7	2.70¢	935ms	155ms	2026-05-02 21:03:45Z
02	Gpt 5 4 Nanoopenai n=90 · all bands	67.0	3.10¢	1699ms	319ms	2026-05-02 21:03:45Z
03	Mistral Nemomistral n=90 · all bands	49.8	4.70¢	1515ms	135ms	2026-05-02 21:03:45Z

3 MODELS · SNAPSHOT v2-latest-overallGENERATED Sat, 02 May 2026 21:03:45 GMT

Q0.976|LATp50 1.2s · p95 2.7s · p99 5.1s|JUDGEgpt-4.1-mini|Q-DEPTH0|$/EVAL$0.00014