Skills llm-evaluation
๐Ÿงช

llm-evaluation

Safe ๐ŸŒ Network accessโš™๏ธ External commands

Build reliable LLM evaluation plans

You need consistent ways to measure LLM quality and regressions. This skill provides metrics, human review guidance, and testing frameworks for reliable AI assessment.

Supports: Claude Codex Code(CC)
๐Ÿ“Š 69 Adequate
1

Download the skill ZIP

2

Upload in Claude

Go to Settings โ†’ Capabilities โ†’ Skills โ†’ Upload skill

3

Toggle on and start using

Test it

Using "llm-evaluation". Propose an evaluation plan for a RAG assistant.

Expected outcome:

  • Automated metrics: MRR, NDCG, Precision at K
  • Human ratings: accuracy, relevance, helpfulness
  • LLM judge: pairwise comparison for final answers
  • Regression rule: fail if accuracy drops more than 5 percent

Using "llm-evaluation". What metrics should I use to evaluate a summarization model?

Expected outcome:

  • ROUGE for n-gram overlap with reference summaries
  • BERTScore for semantic similarity using embeddings
  • Factuality score to verify claims against source text
  • Human readability assessment for coherence and fluency

Using "llm-evaluation". How do I detect if my model is regressing?

Expected outcome:

  • Store baseline scores from a reference model or previous version
  • Compare new model scores against baseline on same test set
  • Flag metrics where relative change exceeds your threshold
  • Run statistical significance test to confirm real differences

Security Audit

Safe
v4 โ€ข 1/17/2026

This skill contains only static documentation (SKILL.md) with no executable files. All static findings are false positives: markdown code block backticks were misidentified as Ruby/shell command execution, and JSON metadata fields were misclassified as cryptographic issues. The skill provides evaluation guidance only with no data access, network activity, or command execution capability.

2
Files scanned
649
Lines analyzed
2
findings
4
Total audits
Audited by: claude View Audit History โ†’

Quality Score

38
Architecture
100
Maintainability
85
Content
21
Community
100
Security
91
Spec Compliance

What You Can Build

Regression gate in CI

Design an evaluation checklist and thresholds to block model changes that reduce quality.

Model comparison brief

Compare two model options using human ratings and automated scores for a decision memo.

Benchmark study plan

Create a benchmarking plan with datasets, metrics, and reporting structure.

Try These Prompts

Starter evaluation plan
Create a basic evaluation plan with 3 automated metrics and 2 human criteria for a customer support chatbot.
Metric selection guide
Recommend metrics for summarization, explain what each captures, and note one limitation per metric.
LLM judge prompt
Draft a pairwise LLM judge prompt to compare response A and B for accuracy, helpfulness, and clarity.
A/B test analysis
Describe a statistical testing plan for A/B evaluation, including sample size guidance and effect size reporting.

Best Practices

  • Use multiple metrics and human review together
  • Test with representative and diverse data
  • Track baselines and statistical significance

Avoid

  • Relying on a single metric
  • Testing on training data
  • Ignoring variance in small samples

Frequently Asked Questions

Is this compatible with Claude and Codex?
Yes, the guidance is model agnostic and applies to Claude, Codex, Claude Code, and other LLMs.
What are the limits of this skill?
It provides guidance and examples but no executable evaluation pipeline in this directory.
How do I integrate with my stack?
Map the metrics and workflows to your existing evaluation or CI tools.
Does it access or store my data?
No, it is static documentation and does not read or transmit data.
What if scores are unstable?
Increase sample size, review variance, and add human validation before decisions.
How is this different from a benchmark list?
It combines metrics, human review, and testing strategy rather than only listing benchmarks.

Developer Details

File structure

๐Ÿ“„ SKILL.md