DeepEval brings pytest-style unit testing to LLM applications. It supports 14+ metrics including hallucination detection, bias, toxicity, relevance, faithfulness, and more. The killer feature is CI/CD integration — you can add LLM tests to your GitHub Actions workflow alongside regular unit tests.
Decorate your LLM app functions with pytest-compatible fixtures, define test cases, and assert against metrics. Metrics like HallucinationMetric use a judge LLM internally, while others (like AnswerRelevancyMetric) combine heuristics with LLM judgment. DeepEval integrates with Confident AI (their commercial platform) for dataset management and dashboards, but works standalone.
DeepEval: free and open source. Confident AI (dashboard + managed datasets): free tier, Pro $99/month, Enterprise custom.
Engineering teams bringing LLM apps into CI/CD. Popular with teams transitioning from 'eyeballing outputs' to systematic testing.