Skills llm-evaluation

🧪

llm-evaluation

Name: llm-evaluation
Author: wshobson

Safe 🌐 Network access⚙️ External commands

Build reliable LLM evaluation plans

Also available from: sickn33

You need consistent ways to measure LLM quality and regressions. This skill provides metrics, human review guidance, and testing frameworks for reliable AI assessment.

Supports: Claude Codex Code(CC)

📊 69 Adequate

Download the skill ZIP

Upload in Claude

Go to Settings → Capabilities → Skills → Upload skill

Toggle on and start using

Test it

Using "llm-evaluation". Propose an evaluation plan for a RAG assistant.

Expected outcome:

Automated metrics: MRR, NDCG, Precision at K
Human ratings: accuracy, relevance, helpfulness
LLM judge: pairwise comparison for final answers
Regression rule: fail if accuracy drops more than 5 percent

Using "llm-evaluation". What metrics should I use to evaluate a summarization model?

Expected outcome:

ROUGE for n-gram overlap with reference summaries
BERTScore for semantic similarity using embeddings
Factuality score to verify claims against source text
Human readability assessment for coherence and fluency

Using "llm-evaluation". How do I detect if my model is regressing?

Expected outcome:

Store baseline scores from a reference model or previous version
Compare new model scores against baseline on same test set
Flag metrics where relative change exceeds your threshold
Run statistical significance test to confirm real differences

Security Audit

Safe

v4 • 1/17/2026

This skill contains only static documentation (SKILL.md) with no executable files. All static findings are false positives: markdown code block backticks were misidentified as Ruby/shell command execution, and JSON metadata fields were misclassified as cryptographic issues. The skill provides evaluation guidance only with no data access, network activity, or command execution capability.

Files scanned

649

Lines analyzed

findings

Total audits

Audited by: claude View Audit History →

Quality Score

Architecture

100

Maintainability

Content

Community

100

Security

Spec Compliance

What You Can Build

Regression gate in CI

Design an evaluation checklist and thresholds to block model changes that reduce quality.

Model comparison brief

Compare two model options using human ratings and automated scores for a decision memo.

Benchmark study plan

Create a benchmarking plan with datasets, metrics, and reporting structure.

Try These Prompts

Starter evaluation plan

Create a basic evaluation plan with 3 automated metrics and 2 human criteria for a customer support chatbot.

Metric selection guide

Recommend metrics for summarization, explain what each captures, and note one limitation per metric.

LLM judge prompt

Draft a pairwise LLM judge prompt to compare response A and B for accuracy, helpfulness, and clarity.

A/B test analysis

Describe a statistical testing plan for A/B evaluation, including sample size guidance and effect size reporting.

Best Practices

Use multiple metrics and human review together
Test with representative and diverse data
Track baselines and statistical significance

Avoid

Relying on a single metric
Testing on training data
Ignoring variance in small samples

Frequently Asked Questions

Is this compatible with Claude and Codex?

Yes, the guidance is model agnostic and applies to Claude, Codex, Claude Code, and other LLMs.

What are the limits of this skill?

It provides guidance and examples but no executable evaluation pipeline in this directory.

How do I integrate with my stack?

Map the metrics and workflows to your existing evaluation or CI tools.

Does it access or store my data?

No, it is static documentation and does not read or transmit data.

What if scores are unstable?

Increase sample size, review variance, and add human validation before decisions.

How is this different from a benchmark list?

It combines metrics, human review, and testing strategy rather than only listing benchmarks.

Developer Details

Author

wshobson

License

MIT

Repository

https://github.com/wshobson/agents/tree/main/plugins/llm-application-dev/skills/llm-evaluation

Ref

main

File structure

📄 SKILL.md

llm-evaluation

Test it

Security Audit

Risk Factors

Quality Score

What You Can Build

Regression gate in CI

Model comparison brief

Benchmark study plan

Try These Prompts

Best Practices

Avoid

Frequently Asked Questions

Developer Details