المهارات llm-evaluation
📦

llm-evaluation

آمن

Evaluate LLM Applications with Comprehensive Metrics

متاح أيضًا من: wshobson

Measuring LLM performance is complex and error-prone. This skill provides systematic evaluation frameworks combining automated metrics, human judgment, and statistical testing to validate AI application quality.

يدعم: Claude Codex Code(CC)
🥉 74 برونزي
1

تنزيل ZIP المهارة

2

رفع في Claude

اذهب إلى Settings → Capabilities → Skills → Upload skill

3

فعّل وابدأ الاستخدام

اختبرها

استخدام "llm-evaluation". Evaluate a summarization model using ROUGE metrics

النتيجة المتوقعة:

ROUGE-1: 0.72, ROUGE-2: 0.58, ROUGE-L: 0.65 - Strong performance on unigram overlap with moderate bigram coherence

استخدام "llm-evaluation". Compare two responses using LLM-as-Judge

النتيجة المتوقعة:

Winner: Response B (confidence: 8/10). Response B provides more accurate citations and better structured arguments, though both answers address the core question adequately.

استخدام "llm-evaluation". Analyze A/B test results for statistical significance

النتيجة المتوقعة:

Variant B shows 12 percent improvement over A with p-value 0.03. Result is statistically significant at alpha=0.05 with medium effect size (Cohen's d=0.54).

التدقيق الأمني

آمن
v1 • 2/25/2026

This skill is documentation-only containing Python code examples for LLM evaluation. All static analysis findings are false positives: Python code blocks were misidentified as Ruby/shell commands, and dictionary keys were incorrectly flagged as cryptographic operations. No executable code or security risks detected.

1
الملفات التي تم فحصها
486
الأسطر التي تم تحليلها
0
النتائج
1
إجمالي عمليات التدقيق
لا توجد مشكلات أمنية
تم تدقيقه بواسطة: claude

درجة الجودة

38
الهندسة المعمارية
100
قابلية الصيانة
87
المحتوى
50
المجتمع
100
الأمان
91
الامتثال للمواصفات

ماذا يمكنك بناءه

ML Engineer Validating Model Changes

Run comprehensive evaluation suites before deploying prompt or model updates to catch performance regressions early.

Product Team Comparing AI Vendors

Benchmark multiple LLM providers on domain-specific tasks to make data-driven vendor selection decisions.

Research Team Publishing Results

Generate statistically rigorous evaluation results with proper metrics and significance testing for academic publications.

جرّب هذه الموجهات

Basic Metric Selection
I need to evaluate an LLM that generates customer support responses. What metrics should I use and how do I implement them?
Build Evaluation Suite
Create an evaluation suite for my RAG application that measures accuracy, groundedness, and retrieval quality. Include both automated and human evaluation components.
A/B Test Analysis
I have evaluation scores from two prompt variants: Variant A [scores] and Variant B [scores]. Determine if the difference is statistically significant and calculate effect size.
Production Evaluation Pipeline
Design a CI/CD integration that runs regression detection on every model update, alerts on performance drops above 5 percent, and generates comparison reports against baseline.

أفضل الممارسات

  • Use multiple complementary metrics rather than optimizing for a single score
  • Always establish baseline performance before measuring improvements
  • Combine automated metrics with human evaluation for comprehensive assessment

تجنب

  • Drawing conclusions from evaluation on too few test examples
  • Using evaluation metrics that do not align with business objectives
  • Testing on data that overlaps with training data (data contamination)

الأسئلة المتكررة

What is the minimum sample size for reliable LLM evaluation?
For statistical significance testing, aim for at least 100 evaluation examples. For high-stakes decisions, 500-1000 examples provide more reliable results with narrower confidence intervals.
How do I choose between automated metrics and human evaluation?
Use automated metrics for fast iteration and regression detection. Add human evaluation for final validation, especially when assessing subjective qualities like helpfulness, safety, or nuanced correctness.
Can LLM-as-Judge replace human evaluators entirely?
LLM-as-Judge works well for routine quality checks and scales efficiently, but human evaluation remains essential for complex judgments, safety assessment, and validating the judge model itself.
How often should I re-run evaluations on my LLM application?
Run evaluations on every code or prompt change as part of CI/CD. For production monitoring, run daily or weekly evaluations on fresh samples to detect drift or performance degradation.
What should I do when metrics disagree with each other?
Metric disagreement often reveals trade-offs. Investigate which metric aligns best with your actual goals through error analysis, and consider using a weighted composite score reflecting business priorities.
How do I evaluate multi-turn conversations?
Use conversation-level metrics like task completion rate and user satisfaction alongside turn-level metrics. Consider coherence across turns and whether the model maintains context appropriately throughout the dialogue.

تفاصيل المطور

المؤلف

sickn33

الترخيص

MIT

مرجع

main

بنية الملفات

📄 SKILL.md