EvalDog grades your prompt & RAG outputs against real assertions, scores every case, and barks the moment a model update breaks something. Hosted dashboard + a zero-token CLI for CI and AI agents.
$ npx evaldog run shopbot.csv --min 80
✓ Greeting & intent
✓ Product search
✓ Add to cart
✗ Order status contains "delivered"
✓ Refund & escalation
80% 4/5 passed (gate 80%) exit 1▌
HOW IT WORKS
Drop a CSV, JSON, or YAML of test cases — the output you already have, plus what to assert.
Every case is checked — contains, equals, regex, valid-JSON, not-empty — and scored pass/fail.
Re-run on every model update. EvalDog flags the moment your score drops. (rolling out)
FOR CI & AGENTS
The evaldog CLI grades locally with no model calls — so an agent can check 200 outputs with a single shell command instead of streaming every case through the LLM.
# fail the build if quality drops
$ npx evaldog run evals/*.csv --min 90 --json
…
✓ 47 passed
✗ 3 failed
94% 213/226 (gate 90% → exit 1)
TRY IT NOW
Five ready-made evals — greeting, search, cart, order status, refund. One click each in the dashboard.
Greeting & Intent
Product Search
Add to Cart
Order Status
Refund & Escalation
Grade your prompts before they ship. Free to try — no card, no setup.