📊

evaluation

Name: evaluation
Author: sickn33

Sicher

Build Agent Evaluation Frameworks

Auch verfügbar von: Asmayaseen,ChakshuGautam,muratcankoylan

Agents require different evaluation approaches than traditional software. This skill provides frameworks to measure agent quality, track improvements, and validate context engineering choices.

Unterstützt: Claude Codex Code(CC)

🥉 73 Bronze

Die Skill-ZIP herunterladen

In Claude hochladen

Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen

Einschalten und loslegen

Teste es

Verwendung von "evaluation". Task: Answer user question about recent AI research Agent Output: 'Recent advances include... [accurate summary of 3 papers with citations]' Rubric: factual_accuracy=0.9, completeness=0.8, citation_accuracy=0.95

Erwartetes Ergebnis:

Evaluation Result:
- Factual Accuracy: 0.9 (Claims match ground truth)
- Completeness: 0.8 (Covers key aspects but missing one area)
- Citation Accuracy: 0.95 (All citations valid)
- Overall Score: 0.88
- Status: PASS (threshold: 0.7)

Verwendung von "evaluation". Test set with 4 complexity levels: - Simple: factual lookup - Medium: comparative analysis - Complex: multi-step reasoning - Very Complex: research synthesis

Erwartetes Ergebnis:

Complexity Stratification Results:
- Simple: 92% pass rate (23/25)
- Medium: 85% pass rate (17/20)
- Complex: 68% pass rate (17/25)
- Very Complex: 45% pass rate (9/20)
Insight: Performance degrades significantly above 'complex' level

Verwendung von "evaluation". Compare context strategies: Full context vs Summarized vs RAG Same test set of 50 queries

Erwartetes Ergebnis:

Context Strategy Comparison:
| Strategy | Avg Score | Tokens Used | Efficiency |
| Full | 0.85 | 45,000 | 0.019 |
| Summarized | 0.72 | 12,000 | 0.060 |
| RAG | 0.78 | 8,000 | 0.098 |
Recommendation: RAG provides best efficiency for factual tasks

Sicherheitsaudit

Sicher

v1 • 2/24/2026

All 29 static findings are false positives. The skill is documentation-only with Python code examples. External commands detection was triggered by markdown code block formatting. Network detection was triggered by a source URL in YAML frontmatter. Weak cryptographic algorithm detection was triggered by the word 'token' (LLM tokens). System/network reconnaissance detection was triggered by the word 'exploration' in evaluation methodology context.

Gescannte Dateien

239

Analysierte Zeilen

befunde

Gesamtzahl Audits

Keine Sicherheitsprobleme gefunden

Erkannte Muster

Markdown Code Block Formatting (False Positive)Source URL in YAML Frontmatter (False Positive)Token Usage in LLM Context (False Positive)Exploration in Evaluation Context (False Positive)

Auditiert von: claude

Qualitätsbewertung

Architektur

Wartbarkeit

Inhalt

Community

100

Sicherheit

Spezifikationskonformität

Was du bauen kannst

Build Quality Gates for Agent Pipelines

Create evaluation frameworks that run automatically before agent deployment to catch regressions and ensure minimum quality thresholds.

Compare Agent Configurations

Systematically compare different agent architectures, model choices, and context strategies using structured evaluation to identify optimal configurations.

Validate Context Engineering Choices

Test different context strategies, degradation patterns, and optimization techniques to ensure context engineering decisions improve agent quality.

Probiere diese Prompts

Basic Agent Response Evaluation

Evaluate the following agent response against the rubric dimensions. Rate each dimension from 0 to 1 and provide specific feedback.

Task: {task_description}
Agent Output: {agent_output}
Expected: {ground_truth_if_available}

Evaluate: factual_accuracy, completeness, citation_accuracy, tool_efficiency

Compare Two Agent Responses

Compare two agent responses for the same task. Identify which performs better on each dimension and explain why.

Task: {task_description}
Response A: {response_a}
Response B: {response_b}

Provide: dimension-by-dimension comparison, overall winner, specific reasoning

Evaluate Extended Interaction

Evaluate a multi-turn agent interaction for overall quality. Consider task completion, consistency, efficiency, and user experience.

Task: {task_description}
Conversation: {full_transcript}

Evaluate: outcome_success, process_quality, efficiency, coherence, user_satisfaction

Generate Evaluation Report

Analyze the evaluation results below and generate a summary report with trends, patterns, and recommendations.

Results: {evaluation_results_json}

Include: pass_rate, average_scores_per_dimension, notable failure patterns, improvement recommendations

Bewährte Verfahren

Start with small test sets during development - early agent changes show large effects
Use multi-dimensional rubrics rather than single metrics to capture agent quality
Supplement automated LLM evaluation with human review for edge cases
Track evaluation metrics over time to identify trends and regressions
Set pass thresholds based on your specific use case requirements, not arbitrary numbers

Vermeiden

Evaluating specific execution paths instead of outcomes - agents may find valid alternative routes
Using only simple test cases - complex interactions reveal different failure modes
Ignoring token efficiency - unlimited context masks real-world performance issues
Setting thresholds too high and blocking all agent changes, or too low and releasing poor quality

Häufig gestellte Fragen

What is LLM-as-judge and how does it work?

LLM-as-judge uses a language model to evaluate agent outputs rather than human reviewers. You provide the task, agent output, evaluation criteria, and the judge model returns structured scores and feedback. It scales to large test sets while maintaining consistent evaluation standards.

How do I determine appropriate pass/fail thresholds?

Set thresholds based on your use case requirements. For high-stakes applications, use 0.8 or higher. For exploratory or prototype agents, 0.5-0.6 may be acceptable. Start conservative and adjust based on production quality feedback.

How many test cases do I need?

Start with 20-50 cases during development. Early in agent development, small samples reveal large effects. Expand to hundreds of cases before production deployment. Sample from real usage patterns and include known edge cases.

Can this skill execute agents for evaluation?

No, this skill provides evaluation methodology, rubric designs, and code examples. You integrate it with your existing agent execution framework to run evaluations automatically.

How do I handle non-deterministic agent behavior?

Run each test case multiple times (3-5 iterations) to capture variance. Report both average scores and standard deviation. Accept that some variation is normal for agent systems and focus on overall trends.

What dimensions should my evaluation rubric include?

At minimum: factual accuracy, completeness, and tool efficiency. Add citation accuracy for research tasks, coherence for multi-turn interactions, and efficiency metrics when token usage matters. Weight dimensions based on your use case priorities.

Entwicklerdetails

Autor

sickn33

Lizenz

MIT

Repository

https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/evaluation

Ref

main

Dateistruktur

📄 SKILL.md