evaluation
Build evaluation frameworks for agent systems
๋ํ ๋ค์์์ ์ฌ์ฉํ ์ ์์ต๋๋ค: ChakshuGautam,muratcankoylan,sickn33
Agent systems lack reliable quality measurement. This skill provides structured evaluation frameworks with multi-dimensional rubrics, test set design, and production monitoring to measure agent performance systematically.
์คํฌ ZIP ๋ค์ด๋ก๋
Claude์์ ์ ๋ก๋
์ค์ โ ๊ธฐ๋ฅ โ ์คํฌ โ ์คํฌ ์ ๋ก๋๋ก ์ด๋
ํ ๊ธ์ ์ผ๊ณ ์ฌ์ฉ ์์
ํ ์คํธํด ๋ณด๊ธฐ
"evaluation" ์ฌ์ฉ ์ค์ ๋๋ค. Evaluate these 3 agent responses for factual accuracy, completeness, and citation quality.
์์ ๊ฒฐ๊ณผ:
- Response A: Overall 0.82 (Good) - Factual: 0.9, Completeness: 0.8, Citations: 0.7 - PASS
- Response B: Overall 0.58 (Acceptable) - Factual: 0.7, Completeness: 0.5, Citations: 0.6 - NEEDS IMPROVEMENT
- Response C: Overall 0.91 (Excellent) - Factual: 1.0, Completeness: 0.85, Citations: 0.9 - PASS
- Recommendation: Focus on improving completeness for responses similar to task type B
"evaluation" ์ฌ์ฉ ์ค์ ๋๋ค. Create a test set for a research agent.
์์ ๊ฒฐ๊ณผ:
- Test Set: 5 tests created
- simple_lookup: Single factual query (complexity: simple)
- context_retrieval: Preference-based recommendation (complexity: medium)
- multi_step_reasoning: Data analysis task (complexity: complex)
- Expected tool calls: 1-3 for simple, 3-5 for medium, 5+ for complex
"evaluation" ์ฌ์ฉ ์ค์ ๋๋ค. Set up production monitoring for quality alerts.
์์ ๊ฒฐ๊ณผ:
- Production Monitor configured
- Sample rate: 1% of interactions
- Warning threshold: 85% pass rate
- Critical threshold: 70% pass rate
- Alert types: quality_drop, low_score, regression
๋ณด์ ๊ฐ์ฌ
์์ This is a legitimate evaluation framework skill containing only documentation and Python evaluation logic. All 79 static findings are FALSE POSITIVES caused by the scanner misinterpreting Markdown code blocks (``` delimiters) as shell backticks, dictionary structures as key files, and floating-point score values (0.0-1.0) as cryptographic algorithms. No network calls, no credential access, no command execution, and no data exfiltration patterns exist in the actual runtime code.
์ํ ์์ธ
โ๏ธ ์ธ๋ถ ๋ช ๋ น์ด (20)
๐ ๋คํธ์ํฌ ์ ๊ทผ (1)
๐ ํ์ผ ์์คํ ์ก์ธ์ค (1)
ํ์ง ์ ์
๋ง๋ค ์ ์๋ ๊ฒ
Test agent performance
Systematically measure agent outputs against defined quality dimensions and pass thresholds
Validate context strategies
Compare how different context engineering approaches affect agent quality and token usage
Track quality trends
Monitor production agent quality over time with automated sampling and alert systems
์ด ํ๋กฌํํธ๋ฅผ ์ฌ์ฉํด ๋ณด์ธ์
Create a test set with 5 test cases of varying complexity (simple to very complex) for evaluating an agent that researches technical topics. Include complexity levels, tags, and ground truth expectations.
Design a multi-dimensional evaluation rubric for [use case: customer support agent]. Define 5 dimensions with weights, level descriptions from 1.0 to 0.0, and explain scoring rationale.
Evaluate the following agent outputs against this rubric. For each output, provide dimension scores, overall score, and pass/fail determination with reasoning.
Build an evaluation pipeline that runs on code changes. Include test set loading, parallel execution, result aggregation, and failure reporting to Slack.
๋ชจ๋ฒ ์ฌ๋ก
- Combine LLM automated evaluation with human review for edge cases and subtle issues
- Evaluate outcomes, not specific execution paths, to account for multiple valid agent approaches
- Track metrics over time to detect regressions and measure the impact of optimizations
ํผํ๊ธฐ
- Evaluating specific steps rather than outcomes, which penalizes valid alternative approaches
- Using single metrics instead of multi-dimensional rubrics that capture different quality aspects
- Testing only with unlimited context, missing performance cliffs that occur with realistic limits