Eval models like it actually matters.

Golden-Eval is a NIST-aligned evaluation framework for LLMs. Score across 34 dimensions, find Pareto-optimal models, gate releases automatically. Cross-family across OpenAI, Anthropic, and 100+ models on OpenRouter.

Open the dashboard Read the methodology →

LIVE FROM PROD● ONLINE
TOP MODELopenai/gpt-5.4-nanoOVERALL97.6%COST$0.04/1k casesp95 LATENCY2.69sEVALUATIONS30Last run: 2026-04-24 17:04 GMT

HOW IT WORKS

Golden Set

Frozen, versioned test cases. Happy-path, adversarial, RAG, edge — every release tested against the same fixed bar.

Rubric + Judge

34 scoring dimensions. Tiered judges: deterministic rules first, LLM-as-judge for nuance, human escalation for the rest.

Release Gate

Hard, soft, and trend thresholds. Pass or fail, with a CI-ready exit code. No gut calls, no ship-on-vibes.

CAPABILITIES

Built for LLM solutions engineers shipping cross-family. Plug in any model on OpenRouter, run a versioned golden set, score it across the dimensions that match your product, gate the release. Adversarial sweeps included.

MULTI-PROVIDER

OpenAI, Anthropic, OpenRouter (100+ available). One harness, one rubric, every family.

34 DIMENSIONS

Agentic, reliability, engineering, fairness, AILuminate harms — score what your product actually depends on.

PARETO REPORTS

Quality × cost × latency frontier. Find the dominated models and stop paying to lose.

ADVERSARIAL

Jailbreak, prompt injection, tool abuse, harm replays. Red-team on every release.