What It Is

DeepEval brings pytest-style unit testing to LLM applications. It supports 14+ metrics including hallucination detection, bias, toxicity, relevance, faithfulness, and more. The killer feature is CI/CD integration — you can add LLM tests to your GitHub Actions workflow alongside regular unit tests.

How It Works

Decorate your LLM app functions with pytest-compatible fixtures, define test cases, and assert against metrics. Metrics like HallucinationMetric use a judge LLM internally, while others (like AnswerRelevancyMetric) combine heuristics with LLM judgment. DeepEval integrates with Confident AI (their commercial platform) for dataset management and dashboards, but works standalone.