Metrics Reference

LLMOps Eval includes 20+ evaluation metrics across four categories.

Traditional NLP

Metric	Description	Use Case
BLEU	N-gram overlap between output and reference	Translation, text generation
ROUGE-L	Longest common subsequence with reference	Summarization
Exact Match	Binary match — output exactly equals reference	Q&A, classification
Levenshtein	Normalized edit distance similarity	Fuzzy matching
BERTScore	Contextual embedding similarity using BERT	Semantic quality

RAG-Specific

Metric	Description
Faithfulness	Is the answer grounded in the retrieved context? Detects hallucinations.
Answer Relevancy	Does the answer actually address the question?
Context Precision	Is the retrieved context useful for answering?
Context Recall	Was all relevant information retrieved?

LLM-as-Judge

Use an LLM to evaluate another LLM's output against defined criteria.

Metric	Description
Relevance	How relevant is the response to the query?
Coherence	Is the response logically structured?
Fluency	Grammar, readability, and natural language quality
Toxicity	Harmful, offensive, or biased content detection
Custom	Define your own criteria with a custom prompt

Performance

Metric	Description
Latency	End-to-end response time in milliseconds
Token Count	Input and output token usage
Cost	Estimated API cost in USD

Example Output

{
  "evaluation_id": "eval_abc123",
  "overall_score": 0.84,
  "metrics": {
    "faithfulness": 0.92,
    "answer_relevancy": 0.88,
    "context_precision": 0.79,
    "bleu": 0.65,
    "bertscore": 0.91,
    "latency_ms": 1240,
    "token_count": 312,
    "cost_usd": 0.0004
  }
}

Choosing Metrics

Use Case	Recommended Metrics
Q&A system	Exact Match, BLEU, Answer Relevancy
RAG application	Faithfulness, Context Precision, Context Recall, Answer Relevancy
Summarization	ROUGE-L, BERTScore, Coherence
Open-ended generation	LLM-as-Judge (Relevance, Coherence, Fluency)
Production monitoring	Latency, Token Count, Cost

Traditional NLP​

RAG-Specific​

LLM-as-Judge​

Performance​

Example Output​

Choosing Metrics​

Traditional NLP

RAG-Specific

LLM-as-Judge

Performance

Example Output

Choosing Metrics