03RUNS

11 persisted evaluation runs across all models.

ModelCoverageGolden setStatusOverallPassp95 latencyCostCompleted
deepseek/deepseek-v4-flash
n=42 · all bands
full-setv2-evalcompleted76.7%33/427.06s$5.0250455/2/2026, 9:03:45 PM
openai/gpt-5.4-nano
n=42 · all bands
full-setv2-evalcompleted67.0%27/426.28s$4.6290515/2/2026, 7:05:25 PM
mistralai/mistral-nemo
n=42 · all bands
full-setv2-evalcompleted49.8%17/425.81s$4.6795025/2/2026, 5:51:24 PM
mistralai/mistral-nemo
n=12 · missing TRIVIAL/EASY/MEDIUM
subsetv2-evalcompleted64.7%5/1210.13s$1.4409844/25/2026, 11:56:08 PM
mistralai/mistral-nemo
n=42 · all bands
full-setv2-evalcompleted48.3%16/429.53s$7.0062574/25/2026, 9:55:03 PM
mistralai/mistral-nemo
n=12 · missing TRIVIAL/EASY/MEDIUM
subsetv2-evalcompleted71.2%7/12$0.0000004/25/2026, 5:52:01 PM
openai/gpt-5-nano
n=12 · missing TRIVIAL/EASY/MEDIUM
subsetv2-evalcompleted67.3%8/12$0.0000004/25/2026, 5:04:33 PM
openai/gpt-5-nano
n=12 · missing HARD
subsetv2-evalcompleted90.8%11/12$0.0000004/25/2026, 4:08:55 PM
openai/gpt-5-nano
n=6 · missing HARD/EXPERT
subsetv2-evalcompleted100.0%6/6$0.0000004/25/2026, 3:58:14 PM
google/gemini-2.0-flash-001
n=6 · missing HARD/EXPERT
subsetv2-evalcompleted100.0%6/6$0.0000004/25/2026, 3:52:08 PM
openai/gpt-5
n=6 · missing HARD/EXPERT
subsetv2-evalcompleted100.0%6/6$0.0000004/25/2026, 3:45:53 PM