Golden Set
Frozen, versioned test cases. Happy-path, adversarial, RAG, edge — every release tested against the same fixed bar.
Golden-Eval is a NIST-aligned evaluation framework for LLMs. Score across 34 dimensions, find Pareto-optimal models, gate releases automatically. Cross-family across OpenAI, Anthropic, and 100+ models on OpenRouter.
Frozen, versioned test cases. Happy-path, adversarial, RAG, edge — every release tested against the same fixed bar.
34 scoring dimensions. Tiered judges: deterministic rules first, LLM-as-judge for nuance, human escalation for the rest.
Hard, soft, and trend thresholds. Pass or fail, with a CI-ready exit code. No gut calls, no ship-on-vibes.
Built for LLM solutions engineers shipping cross-family. Plug in any model on OpenRouter, run a versioned golden set, score it across the dimensions that match your product, gate the release. Adversarial sweeps included.
OpenAI, Anthropic, OpenRouter (100+ available). One harness, one rubric, every family.
Agentic, reliability, engineering, fairness, AILuminate harms — score what your product actually depends on.
Quality × cost × latency frontier. Find the dominated models and stop paying to lose.
Jailbreak, prompt injection, tool abuse, harm replays. Red-team on every release.