03RUNS

2 persisted evaluation runs across all models.

ModelGolden setStatusOverallPassp95 latencyCostCompleted
deepseek/deepseek-chat-v3.1starter-v1completed85.1%28/3060.65s$0.0069904/24/2026, 5:15:54 PM
openai/gpt-5.4-nanostarter-v1completed97.6%29/302.70s$0.0012874/24/2026, 5:04:03 PM