llm-evaluation
Evaluate LLM Applications with Comprehensive Metrics
Auch verfĂĽgbar von: wshobson
Measuring LLM performance is complex and error-prone. This skill provides systematic evaluation frameworks combining automated metrics, human judgment, and statistical testing to validate AI application quality.
Die Skill-ZIP herunterladen
In Claude hochladen
Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen
Einschalten und loslegen
Teste es
Verwendung von "llm-evaluation". Evaluate a summarization model using ROUGE metrics
Erwartetes Ergebnis:
ROUGE-1: 0.72, ROUGE-2: 0.58, ROUGE-L: 0.65 - Strong performance on unigram overlap with moderate bigram coherence
Verwendung von "llm-evaluation". Compare two responses using LLM-as-Judge
Erwartetes Ergebnis:
Winner: Response B (confidence: 8/10). Response B provides more accurate citations and better structured arguments, though both answers address the core question adequately.
Verwendung von "llm-evaluation". Analyze A/B test results for statistical significance
Erwartetes Ergebnis:
Variant B shows 12 percent improvement over A with p-value 0.03. Result is statistically significant at alpha=0.05 with medium effect size (Cohen's d=0.54).
Sicherheitsaudit
SicherThis skill is documentation-only containing Python code examples for LLM evaluation. All static analysis findings are false positives: Python code blocks were misidentified as Ruby/shell commands, and dictionary keys were incorrectly flagged as cryptographic operations. No executable code or security risks detected.
Qualitätsbewertung
Was du bauen kannst
ML Engineer Validating Model Changes
Run comprehensive evaluation suites before deploying prompt or model updates to catch performance regressions early.
Product Team Comparing AI Vendors
Benchmark multiple LLM providers on domain-specific tasks to make data-driven vendor selection decisions.
Research Team Publishing Results
Generate statistically rigorous evaluation results with proper metrics and significance testing for academic publications.
Probiere diese Prompts
I need to evaluate an LLM that generates customer support responses. What metrics should I use and how do I implement them?
Create an evaluation suite for my RAG application that measures accuracy, groundedness, and retrieval quality. Include both automated and human evaluation components.
I have evaluation scores from two prompt variants: Variant A [scores] and Variant B [scores]. Determine if the difference is statistically significant and calculate effect size.
Design a CI/CD integration that runs regression detection on every model update, alerts on performance drops above 5 percent, and generates comparison reports against baseline.
Bewährte Verfahren
- Use multiple complementary metrics rather than optimizing for a single score
- Always establish baseline performance before measuring improvements
- Combine automated metrics with human evaluation for comprehensive assessment
Vermeiden
- Drawing conclusions from evaluation on too few test examples
- Using evaluation metrics that do not align with business objectives
- Testing on data that overlaps with training data (data contamination)