What it's used for

DeepEval is an open-source framework that brings unit testing to LLM applications. It provides metrics like faithfulness, answer relevancy, hallucination detection, contextual recall, and toxicity that integrate directly with pytest for CI/CD pipeline testing.

Key use cases include:

RAG evaluation — measuring faithfulness, context precision, and answer relevancy
Hallucination detection — verifying LLM outputs are grounded in provided context
Toxicity testing — screening outputs for harmful or inappropriate content
Regression testing — catching quality degradation when prompts or models change
CI/CD integration — running LLM tests automatically in your deployment pipeline with pytest
Benchmarking — comparing models and prompts with standardized metrics

DeepEval is used by developers who want to test LLM applications with the same rigor as traditional software testing. The pytest integration means LLM evaluation slots naturally into existing testing workflows and CI/CD pipelines.

All metrics provide explanations for their scores, making it easy to understand why a test passed or failed.

Getting started

Install DeepEval:
```
pip install deepeval
```
Set your OpenAI key (used for metric computation):
```
export OPENAI_API_KEY='sk-...'
```

Write a test:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

test_case = LLMTestCase(
    input='What is the capital of France?',
    actual_output='Paris is the capital of France.',
    retrieval_context=['France is a country in Europe. Its capital is Paris.']
)
metric = FaithfulnessMetric(threshold=0.7)
assert_test(test_case, [metric])

Run with pytest:
```
deepeval test run test_llm.py
```

Pricing: DeepEval is free and open source. Confident AI (the hosted platform) offers free and paid tiers for dashboards, collaboration, and dataset management. Metric computation uses LLM API calls (~$0.01-0.05 per test case).

Case studies

Real Deepeval projects

Submitted by verified specialists

67% fewer user issues Developer Tools

800-Test Eval Suite — 67% Fewer User-Reported Issues

200k-user AI coding tool

› Challenge

A popular AI coding tool was shipping regressions to 200k users because their LLM quality checks were ad-hoc and manual. User-reported quality issues were their top support category.

› Solution

Built an 800-case Deepeval test suite covering correctness, coherence, toxicity, PII leakage, and brand voice. Integrated into CI/CD with automatic blocking on quality regression. Added weekly eval reports to engineering standup.

› Results

Critical bugs caught before release increased 4x. User-reported quality issues per release dropped 67%. The engineering team now ships with confidence — one eval run replaced three days of manual QA.

Tools

Deepeval BrainTrust Langfuse Arize Phoenix

Hire an expert

2 critical failures caught pre-launch AI Startup

Series A Technical Diligence — 2 Critical Failures Caught

AI startup, pre-Series A

› Challenge

A pre-Series A AI startup needed to demonstrate production-ready quality to investors. Their informal evaluation process hadn't caught two critical safety and accuracy failures that would have been discovered in diligence.

› Solution

Built a 500-case Deepeval harness covering factuality (measured vs ground truth), safety (toxic outputs, jailbreak attempts), calibration (confidence vs accuracy), and instruction following. Ran against all 3 candidate models.

› Results

Caught 2 critical failures: a safety bypass in Model A and systematic hallucination on out-of-distribution inputs in Model B. Fixed before investor demo. Eval suite adopted as the company's permanent quality gate. Series A closed at $8M.

Tools

Deepeval DSPy Hugging Face Arize Phoenix

Hire an expert

Used Deepeval professionally?

Add your case study and get discovered by clients.

Submit a case study