Fähigkeiten evaluation
📊

evaluation

Sicher

Build Agent Evaluation Frameworks

Auch verfĂĽgbar von: Asmayaseen,ChakshuGautam,muratcankoylan

Agents require different evaluation approaches than traditional software. This skill provides frameworks to measure agent quality, track improvements, and validate context engineering choices.

UnterstĂĽtzt: Claude Codex Code(CC)
🥉 73 Bronze
1

Die Skill-ZIP herunterladen

2

In Claude hochladen

Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen

3

Einschalten und loslegen

Teste es

Verwendung von "evaluation". Task: Answer user question about recent AI research Agent Output: 'Recent advances include... [accurate summary of 3 papers with citations]' Rubric: factual_accuracy=0.9, completeness=0.8, citation_accuracy=0.95

Erwartetes Ergebnis:

Evaluation Result:
- Factual Accuracy: 0.9 (Claims match ground truth)
- Completeness: 0.8 (Covers key aspects but missing one area)
- Citation Accuracy: 0.95 (All citations valid)
- Overall Score: 0.88
- Status: PASS (threshold: 0.7)

Verwendung von "evaluation". Test set with 4 complexity levels: - Simple: factual lookup - Medium: comparative analysis - Complex: multi-step reasoning - Very Complex: research synthesis

Erwartetes Ergebnis:

Complexity Stratification Results:
- Simple: 92% pass rate (23/25)
- Medium: 85% pass rate (17/20)
- Complex: 68% pass rate (17/25)
- Very Complex: 45% pass rate (9/20)
Insight: Performance degrades significantly above 'complex' level

Verwendung von "evaluation". Compare context strategies: Full context vs Summarized vs RAG Same test set of 50 queries

Erwartetes Ergebnis:

Context Strategy Comparison:
| Strategy | Avg Score | Tokens Used | Efficiency |
| Full | 0.85 | 45,000 | 0.019 |
| Summarized | 0.72 | 12,000 | 0.060 |
| RAG | 0.78 | 8,000 | 0.098 |
Recommendation: RAG provides best efficiency for factual tasks

Sicherheitsaudit

Sicher
v1 • 2/24/2026

All 29 static findings are false positives. The skill is documentation-only with Python code examples. External commands detection was triggered by markdown code block formatting. Network detection was triggered by a source URL in YAML frontmatter. Weak cryptographic algorithm detection was triggered by the word 'token' (LLM tokens). System/network reconnaissance detection was triggered by the word 'exploration' in evaluation methodology context.

1
Gescannte Dateien
239
Analysierte Zeilen
0
befunde
1
Gesamtzahl Audits
Keine Sicherheitsprobleme gefunden

Erkannte Muster

Markdown Code Block Formatting (False Positive)Source URL in YAML Frontmatter (False Positive)Token Usage in LLM Context (False Positive)Exploration in Evaluation Context (False Positive)
Auditiert von: claude

Qualitätsbewertung

38
Architektur
95
Wartbarkeit
87
Inhalt
50
Community
100
Sicherheit
91
Spezifikationskonformität

Was du bauen kannst

Build Quality Gates for Agent Pipelines

Create evaluation frameworks that run automatically before agent deployment to catch regressions and ensure minimum quality thresholds.

Compare Agent Configurations

Systematically compare different agent architectures, model choices, and context strategies using structured evaluation to identify optimal configurations.

Validate Context Engineering Choices

Test different context strategies, degradation patterns, and optimization techniques to ensure context engineering decisions improve agent quality.

Probiere diese Prompts

Basic Agent Response Evaluation
Evaluate the following agent response against the rubric dimensions. Rate each dimension from 0 to 1 and provide specific feedback.

Task: {task_description}
Agent Output: {agent_output}
Expected: {ground_truth_if_available}

Evaluate: factual_accuracy, completeness, citation_accuracy, tool_efficiency
Compare Two Agent Responses
Compare two agent responses for the same task. Identify which performs better on each dimension and explain why.

Task: {task_description}
Response A: {response_a}
Response B: {response_b}

Provide: dimension-by-dimension comparison, overall winner, specific reasoning
Evaluate Extended Interaction
Evaluate a multi-turn agent interaction for overall quality. Consider task completion, consistency, efficiency, and user experience.

Task: {task_description}
Conversation: {full_transcript}

Evaluate: outcome_success, process_quality, efficiency, coherence, user_satisfaction
Generate Evaluation Report
Analyze the evaluation results below and generate a summary report with trends, patterns, and recommendations.

Results: {evaluation_results_json}

Include: pass_rate, average_scores_per_dimension, notable failure patterns, improvement recommendations

Bewährte Verfahren

  • Start with small test sets during development - early agent changes show large effects
  • Use multi-dimensional rubrics rather than single metrics to capture agent quality
  • Supplement automated LLM evaluation with human review for edge cases
  • Track evaluation metrics over time to identify trends and regressions
  • Set pass thresholds based on your specific use case requirements, not arbitrary numbers

Vermeiden

  • Evaluating specific execution paths instead of outcomes - agents may find valid alternative routes
  • Using only simple test cases - complex interactions reveal different failure modes
  • Ignoring token efficiency - unlimited context masks real-world performance issues
  • Setting thresholds too high and blocking all agent changes, or too low and releasing poor quality

Häufig gestellte Fragen

What is LLM-as-judge and how does it work?
LLM-as-judge uses a language model to evaluate agent outputs rather than human reviewers. You provide the task, agent output, evaluation criteria, and the judge model returns structured scores and feedback. It scales to large test sets while maintaining consistent evaluation standards.
How do I determine appropriate pass/fail thresholds?
Set thresholds based on your use case requirements. For high-stakes applications, use 0.8 or higher. For exploratory or prototype agents, 0.5-0.6 may be acceptable. Start conservative and adjust based on production quality feedback.
How many test cases do I need?
Start with 20-50 cases during development. Early in agent development, small samples reveal large effects. Expand to hundreds of cases before production deployment. Sample from real usage patterns and include known edge cases.
Can this skill execute agents for evaluation?
No, this skill provides evaluation methodology, rubric designs, and code examples. You integrate it with your existing agent execution framework to run evaluations automatically.
How do I handle non-deterministic agent behavior?
Run each test case multiple times (3-5 iterations) to capture variance. Report both average scores and standard deviation. Accept that some variation is normal for agent systems and focus on overall trends.
What dimensions should my evaluation rubric include?
At minimum: factual accuracy, completeness, and tool efficiency. Add citation accuracy for research tasks, coherence for multi-turn interactions, and efficiency metrics when token usage matters. Weight dimensions based on your use case priorities.

Entwicklerdetails

Dateistruktur

đź“„ SKILL.md