Skip to main content

Metrics Reference

LLMOps Eval includes 20+ evaluation metrics across four categories.

Traditional NLP

MetricDescriptionUse Case
BLEUN-gram overlap between output and referenceTranslation, text generation
ROUGE-LLongest common subsequence with referenceSummarization
Exact MatchBinary match — output exactly equals referenceQ&A, classification
LevenshteinNormalized edit distance similarityFuzzy matching
BERTScoreContextual embedding similarity using BERTSemantic quality

RAG-Specific

MetricDescription
FaithfulnessIs the answer grounded in the retrieved context? Detects hallucinations.
Answer RelevancyDoes the answer actually address the question?
Context PrecisionIs the retrieved context useful for answering?
Context RecallWas all relevant information retrieved?

LLM-as-Judge

Use an LLM to evaluate another LLM's output against defined criteria.

MetricDescription
RelevanceHow relevant is the response to the query?
CoherenceIs the response logically structured?
FluencyGrammar, readability, and natural language quality
ToxicityHarmful, offensive, or biased content detection
CustomDefine your own criteria with a custom prompt

Performance

MetricDescription
LatencyEnd-to-end response time in milliseconds
Token CountInput and output token usage
CostEstimated API cost in USD

Example Output

{
"evaluation_id": "eval_abc123",
"overall_score": 0.84,
"metrics": {
"faithfulness": 0.92,
"answer_relevancy": 0.88,
"context_precision": 0.79,
"bleu": 0.65,
"bertscore": 0.91,
"latency_ms": 1240,
"token_count": 312,
"cost_usd": 0.0004
}
}

Choosing Metrics

Use CaseRecommended Metrics
Q&A systemExact Match, BLEU, Answer Relevancy
RAG applicationFaithfulness, Context Precision, Context Recall, Answer Relevancy
SummarizationROUGE-L, BERTScore, Coherence
Open-ended generationLLM-as-Judge (Relevance, Coherence, Fluency)
Production monitoringLatency, Token Count, Cost