03RUNS
11 persisted evaluation runs across all models.
| Model | Coverage | Golden set | Status | Overall | Pass | p95 latency | Cost | Completed |
|---|---|---|---|---|---|---|---|---|
| deepseek/deepseek-v4-flash n=42 · all bands | full-set | v2-eval | completed | 76.7% | 33/42 | 7.06s | $5.025045 | 5/2/2026, 9:03:45 PM |
| openai/gpt-5.4-nano n=42 · all bands | full-set | v2-eval | completed | 67.0% | 27/42 | 6.28s | $4.629051 | 5/2/2026, 7:05:25 PM |
| mistralai/mistral-nemo n=42 · all bands | full-set | v2-eval | completed | 49.8% | 17/42 | 5.81s | $4.679502 | 5/2/2026, 5:51:24 PM |
| mistralai/mistral-nemo n=12 · missing TRIVIAL/EASY/MEDIUM | subset | v2-eval | completed | 64.7% | 5/12 | 10.13s | $1.440984 | 4/25/2026, 11:56:08 PM |
| mistralai/mistral-nemo n=42 · all bands | full-set | v2-eval | completed | 48.3% | 16/42 | 9.53s | $7.006257 | 4/25/2026, 9:55:03 PM |
| mistralai/mistral-nemo n=12 · missing TRIVIAL/EASY/MEDIUM | subset | v2-eval | completed | 71.2% | 7/12 | — | $0.000000 | 4/25/2026, 5:52:01 PM |
| openai/gpt-5-nano n=12 · missing TRIVIAL/EASY/MEDIUM | subset | v2-eval | completed | 67.3% | 8/12 | — | $0.000000 | 4/25/2026, 5:04:33 PM |
| openai/gpt-5-nano n=12 · missing HARD | subset | v2-eval | completed | 90.8% | 11/12 | — | $0.000000 | 4/25/2026, 4:08:55 PM |
| openai/gpt-5-nano n=6 · missing HARD/EXPERT | subset | v2-eval | completed | 100.0% | 6/6 | — | $0.000000 | 4/25/2026, 3:58:14 PM |
| google/gemini-2.0-flash-001 n=6 · missing HARD/EXPERT | subset | v2-eval | completed | 100.0% | 6/6 | — | $0.000000 | 4/25/2026, 3:52:08 PM |
| openai/gpt-5 n=6 · missing HARD/EXPERT | subset | v2-eval | completed | 100.0% | 6/6 | — | $0.000000 | 4/25/2026, 3:45:53 PM |
Q0.976LATp50 1.2s · p95 2.7s · p99 5.1sJUDGEgpt-4.1-miniQ-DEPTH0$/EVAL$0.00014