Ragas is the de facto open-source framework for evaluating retrieval-augmented generation (RAG) pipelines. It implements six canonical metrics — faithfulness, answer relevance, context precision, context recall, answer correctness, and answer semantic similarity — that measure different dimensions of RAG quality.
Ragas uses an LLM-as-judge approach: for each metric, Ragas prompts a judge LLM (typically GPT-4 or Claude) to evaluate your generated answer against the question, retrieved contexts, and ground truth. Results are aggregated into a score per metric and per example. Integrates with LangChain, LlamaIndex, and raw Python pipelines.
Ragas itself is free and open source (Apache 2.0). You pay for the judge LLM calls (typically $0.10-$1 per example depending on judge model).
Most teams building production RAG. Used as the de facto baseline for RAG eval in both research and industry.