evaluation
Build Agent Evaluation Frameworks
Également disponible depuis: Asmayaseen,ChakshuGautam,muratcankoylan
Agents require different evaluation approaches than traditional software. This skill provides frameworks to measure agent quality, track improvements, and validate context engineering choices.
Télécharger le ZIP du skill
Importer dans Claude
Allez dans Paramètres → Capacités → Skills → Importer un skill
Activez et commencez Ă utiliser
Tester
Utilisation de "evaluation". Task: Answer user question about recent AI research Agent Output: 'Recent advances include... [accurate summary of 3 papers with citations]' Rubric: factual_accuracy=0.9, completeness=0.8, citation_accuracy=0.95
Résultat attendu:
Evaluation Result:
- Factual Accuracy: 0.9 (Claims match ground truth)
- Completeness: 0.8 (Covers key aspects but missing one area)
- Citation Accuracy: 0.95 (All citations valid)
- Overall Score: 0.88
- Status: PASS (threshold: 0.7)
Utilisation de "evaluation". Test set with 4 complexity levels: - Simple: factual lookup - Medium: comparative analysis - Complex: multi-step reasoning - Very Complex: research synthesis
Résultat attendu:
Complexity Stratification Results:
- Simple: 92% pass rate (23/25)
- Medium: 85% pass rate (17/20)
- Complex: 68% pass rate (17/25)
- Very Complex: 45% pass rate (9/20)
Insight: Performance degrades significantly above 'complex' level
Utilisation de "evaluation". Compare context strategies: Full context vs Summarized vs RAG Same test set of 50 queries
Résultat attendu:
Context Strategy Comparison:
| Strategy | Avg Score | Tokens Used | Efficiency |
| Full | 0.85 | 45,000 | 0.019 |
| Summarized | 0.72 | 12,000 | 0.060 |
| RAG | 0.78 | 8,000 | 0.098 |
Recommendation: RAG provides best efficiency for factual tasks
Audit de sécurité
SûrAll 29 static findings are false positives. The skill is documentation-only with Python code examples. External commands detection was triggered by markdown code block formatting. Network detection was triggered by a source URL in YAML frontmatter. Weak cryptographic algorithm detection was triggered by the word 'token' (LLM tokens). System/network reconnaissance detection was triggered by the word 'exploration' in evaluation methodology context.
Motifs détectés
Score de qualité
Ce que vous pouvez construire
Build Quality Gates for Agent Pipelines
Create evaluation frameworks that run automatically before agent deployment to catch regressions and ensure minimum quality thresholds.
Compare Agent Configurations
Systematically compare different agent architectures, model choices, and context strategies using structured evaluation to identify optimal configurations.
Validate Context Engineering Choices
Test different context strategies, degradation patterns, and optimization techniques to ensure context engineering decisions improve agent quality.
Essayez ces prompts
Evaluate the following agent response against the rubric dimensions. Rate each dimension from 0 to 1 and provide specific feedback.
Task: {task_description}
Agent Output: {agent_output}
Expected: {ground_truth_if_available}
Evaluate: factual_accuracy, completeness, citation_accuracy, tool_efficiencyCompare two agent responses for the same task. Identify which performs better on each dimension and explain why.
Task: {task_description}
Response A: {response_a}
Response B: {response_b}
Provide: dimension-by-dimension comparison, overall winner, specific reasoningEvaluate a multi-turn agent interaction for overall quality. Consider task completion, consistency, efficiency, and user experience.
Task: {task_description}
Conversation: {full_transcript}
Evaluate: outcome_success, process_quality, efficiency, coherence, user_satisfactionAnalyze the evaluation results below and generate a summary report with trends, patterns, and recommendations.
Results: {evaluation_results_json}
Include: pass_rate, average_scores_per_dimension, notable failure patterns, improvement recommendationsBonnes pratiques
- Start with small test sets during development - early agent changes show large effects
- Use multi-dimensional rubrics rather than single metrics to capture agent quality
- Supplement automated LLM evaluation with human review for edge cases
- Track evaluation metrics over time to identify trends and regressions
- Set pass thresholds based on your specific use case requirements, not arbitrary numbers
Éviter
- Evaluating specific execution paths instead of outcomes - agents may find valid alternative routes
- Using only simple test cases - complex interactions reveal different failure modes
- Ignoring token efficiency - unlimited context masks real-world performance issues
- Setting thresholds too high and blocking all agent changes, or too low and releasing poor quality