📊

evaluation

Name: evaluation
Author: sickn33

Sûr

Build Agent Evaluation Frameworks

Également disponible depuis: Asmayaseen,ChakshuGautam,muratcankoylan

Agents require different evaluation approaches than traditional software. This skill provides frameworks to measure agent quality, track improvements, and validate context engineering choices.

Prend en charge: Claude Codex Code(CC)

🥉 73 Bronze

Télécharger le ZIP du skill

Importer dans Claude

Allez dans Paramètres → Capacités → Skills → Importer un skill

Activez et commencez à utiliser

Tester

Utilisation de "evaluation". Task: Answer user question about recent AI research Agent Output: 'Recent advances include... [accurate summary of 3 papers with citations]' Rubric: factual_accuracy=0.9, completeness=0.8, citation_accuracy=0.95

Résultat attendu:

Evaluation Result:
- Factual Accuracy: 0.9 (Claims match ground truth)
- Completeness: 0.8 (Covers key aspects but missing one area)
- Citation Accuracy: 0.95 (All citations valid)
- Overall Score: 0.88
- Status: PASS (threshold: 0.7)

Utilisation de "evaluation". Test set with 4 complexity levels: - Simple: factual lookup - Medium: comparative analysis - Complex: multi-step reasoning - Very Complex: research synthesis

Résultat attendu:

Complexity Stratification Results:
- Simple: 92% pass rate (23/25)
- Medium: 85% pass rate (17/20)
- Complex: 68% pass rate (17/25)
- Very Complex: 45% pass rate (9/20)
Insight: Performance degrades significantly above 'complex' level

Utilisation de "evaluation". Compare context strategies: Full context vs Summarized vs RAG Same test set of 50 queries

Résultat attendu:

Context Strategy Comparison:
| Strategy | Avg Score | Tokens Used | Efficiency |
| Full | 0.85 | 45,000 | 0.019 |
| Summarized | 0.72 | 12,000 | 0.060 |
| RAG | 0.78 | 8,000 | 0.098 |
Recommendation: RAG provides best efficiency for factual tasks

Audit de sécurité

Sûr

v1 • 2/24/2026

All 29 static findings are false positives. The skill is documentation-only with Python code examples. External commands detection was triggered by markdown code block formatting. Network detection was triggered by a source URL in YAML frontmatter. Weak cryptographic algorithm detection was triggered by the word 'token' (LLM tokens). System/network reconnaissance detection was triggered by the word 'exploration' in evaluation methodology context.

Fichiers analysés

239

Lignes analysées

résultats

Total des audits

Aucun problème de sécurité trouvé

Motifs détectés

Markdown Code Block Formatting (False Positive)Source URL in YAML Frontmatter (False Positive)Token Usage in LLM Context (False Positive)Exploration in Evaluation Context (False Positive)

Audité par: claude

Score de qualité

Architecture

Maintenabilité

Contenu

Communauté

100

Sécurité

Conformité aux spécifications

Ce que vous pouvez construire

Build Quality Gates for Agent Pipelines

Create evaluation frameworks that run automatically before agent deployment to catch regressions and ensure minimum quality thresholds.

Compare Agent Configurations

Systematically compare different agent architectures, model choices, and context strategies using structured evaluation to identify optimal configurations.

Validate Context Engineering Choices

Test different context strategies, degradation patterns, and optimization techniques to ensure context engineering decisions improve agent quality.

Essayez ces prompts

Basic Agent Response Evaluation

Evaluate the following agent response against the rubric dimensions. Rate each dimension from 0 to 1 and provide specific feedback.

Task: {task_description}
Agent Output: {agent_output}
Expected: {ground_truth_if_available}

Evaluate: factual_accuracy, completeness, citation_accuracy, tool_efficiency

Compare Two Agent Responses

Compare two agent responses for the same task. Identify which performs better on each dimension and explain why.

Task: {task_description}
Response A: {response_a}
Response B: {response_b}

Provide: dimension-by-dimension comparison, overall winner, specific reasoning

Evaluate Extended Interaction

Evaluate a multi-turn agent interaction for overall quality. Consider task completion, consistency, efficiency, and user experience.

Task: {task_description}
Conversation: {full_transcript}

Evaluate: outcome_success, process_quality, efficiency, coherence, user_satisfaction

Generate Evaluation Report

Analyze the evaluation results below and generate a summary report with trends, patterns, and recommendations.

Results: {evaluation_results_json}

Include: pass_rate, average_scores_per_dimension, notable failure patterns, improvement recommendations

Bonnes pratiques

Start with small test sets during development - early agent changes show large effects
Use multi-dimensional rubrics rather than single metrics to capture agent quality
Supplement automated LLM evaluation with human review for edge cases
Track evaluation metrics over time to identify trends and regressions
Set pass thresholds based on your specific use case requirements, not arbitrary numbers

Éviter

Evaluating specific execution paths instead of outcomes - agents may find valid alternative routes
Using only simple test cases - complex interactions reveal different failure modes
Ignoring token efficiency - unlimited context masks real-world performance issues
Setting thresholds too high and blocking all agent changes, or too low and releasing poor quality

Foire aux questions

What is LLM-as-judge and how does it work?

LLM-as-judge uses a language model to evaluate agent outputs rather than human reviewers. You provide the task, agent output, evaluation criteria, and the judge model returns structured scores and feedback. It scales to large test sets while maintaining consistent evaluation standards.

How do I determine appropriate pass/fail thresholds?

Set thresholds based on your use case requirements. For high-stakes applications, use 0.8 or higher. For exploratory or prototype agents, 0.5-0.6 may be acceptable. Start conservative and adjust based on production quality feedback.

How many test cases do I need?

Start with 20-50 cases during development. Early in agent development, small samples reveal large effects. Expand to hundreds of cases before production deployment. Sample from real usage patterns and include known edge cases.

Can this skill execute agents for evaluation?

No, this skill provides evaluation methodology, rubric designs, and code examples. You integrate it with your existing agent execution framework to run evaluations automatically.

How do I handle non-deterministic agent behavior?

Run each test case multiple times (3-5 iterations) to capture variance. Report both average scores and standard deviation. Accept that some variation is normal for agent systems and focus on overall trends.

What dimensions should my evaluation rubric include?

At minimum: factual accuracy, completeness, and tool efficiency. Add citation accuracy for research tasks, coherence for multi-turn interactions, and efficiency metrics when token usage matters. Weight dimensions based on your use case priorities.

Détails du développeur

Auteur

sickn33

Licence

MIT

Dépôt

https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/evaluation

Réf

main

Structure de fichiers

📄 SKILL.md