DeepEval is an open-source framework that brings unit testing to LLM applications. It provides metrics like faithfulness, answer relevancy, hallucination detection, contextual recall, and toxicity that integrate directly with pytest for CI/CD pipeline testing.
Key use cases include:
DeepEval is used by developers who want to test LLM applications with the same rigor as traditional software testing. The pytest integration means LLM evaluation slots naturally into existing testing workflows and CI/CD pipelines.
All metrics provide explanations for their scores, making it easy to understand why a test passed or failed.
pip install deepevalexport OPENAI_API_KEY='sk-...'from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric
test_case = LLMTestCase(
input='What is the capital of France?',
actual_output='Paris is the capital of France.',
retrieval_context=['France is a country in Europe. Its capital is Paris.']
)
metric = FaithfulnessMetric(threshold=0.7)
assert_test(test_case, [metric])deepeval test run test_llm.pyPricing: DeepEval is free and open source. Confident AI (the hosted platform) offers free and paid tiers for dashboards, collaboration, and dataset management. Metric computation uses LLM API calls (~$0.01-0.05 per test case).
Case studies
200k-user AI coding tool
A popular AI coding tool was shipping regressions to 200k users because their LLM quality checks were ad-hoc and manual. User-reported quality issues were their top support category.
Built an 800-case Deepeval test suite covering correctness, coherence, toxicity, PII leakage, and brand voice. Integrated into CI/CD with automatic blocking on quality regression. Added weekly eval reports to engineering standup.
Critical bugs caught before release increased 4x. User-reported quality issues per release dropped 67%. The engineering team now ships with confidence — one eval run replaced three days of manual QA.
AI startup, pre-Series A
A pre-Series A AI startup needed to demonstrate production-ready quality to investors. Their informal evaluation process hadn't caught two critical safety and accuracy failures that would have been discovered in diligence.
Built a 500-case Deepeval harness covering factuality (measured vs ground truth), safety (toxic outputs, jailbreak attempts), calibration (confidence vs accuracy), and instruction following. Ran against all 3 candidate models.
Caught 2 critical failures: a safety bypass in Model A and systematic hallucination on out-of-distribution inputs in Model B. Fixed before investor demo. Eval suite adopted as the company's permanent quality gate. Series A closed at $8M.
Submit a brief and we'll match you with vetted specialists who have proven Deepeval experience.