GOLDEN-EVALREV 1618.04

Eval models like it actually matters.

Golden-Eval is a NIST-aligned evaluation framework for LLMs. Score across 34 dimensions, find Pareto-optimal models, gate releases automatically. Cross-family across OpenAI, Anthropic, and 100+ models on OpenRouter.

LIVE FROM PROD● ONLINE
TOP MODEL
openai/gpt-5.4-nano
OVERALL
97.6%
COST
$0.04/1k cases
p95 LATENCY
2.69s
EVALUATIONS
30
Last run: 2026-04-24 17:04 GMT
HOW IT WORKS
01

Golden Set

Frozen, versioned test cases. Happy-path, adversarial, RAG, edge — every release tested against the same fixed bar.

02

Rubric + Judge

34 scoring dimensions. Tiered judges: deterministic rules first, LLM-as-judge for nuance, human escalation for the rest.

03

Release Gate

Hard, soft, and trend thresholds. Pass or fail, with a CI-ready exit code. No gut calls, no ship-on-vibes.

CAPABILITIES

Built for LLM solutions engineers shipping cross-family. Plug in any model on OpenRouter, run a versioned golden set, score it across the dimensions that match your product, gate the release. Adversarial sweeps included.

MULTI-PROVIDER

OpenAI, Anthropic, OpenRouter (100+ available). One harness, one rubric, every family.

34 DIMENSIONS

Agentic, reliability, engineering, fairness, AILuminate harms — score what your product actually depends on.

PARETO REPORTS

Quality × cost × latency frontier. Find the dominated models and stop paying to lose.

ADVERSARIAL

Jailbreak, prompt injection, tool abuse, harm replays. Red-team on every release.