스킬 evaluation

📊

evaluation

Name: evaluation
Author: Asmayaseen

안전 ⚙️ 외부 명령어🌐 네트워크 접근📁 파일 시스템 액세스

Build evaluation frameworks for agent systems

또한 다음에서 사용할 수 있습니다: ChakshuGautam,muratcankoylan,sickn33

Agent systems lack reliable quality measurement. This skill provides structured evaluation frameworks with multi-dimensional rubrics, test set design, and production monitoring to measure agent performance systematically.

지원: Claude Codex Code(CC)

🥉 76 브론즈

스킬 ZIP 다운로드

Claude에서 업로드

설정 → 기능 → 스킬 → 스킬 업로드로 이동

토글을 켜고 사용 시작

테스트해 보기

"evaluation" 사용 중입니다. Evaluate these 3 agent responses for factual accuracy, completeness, and citation quality.

예상 결과:

Response A: Overall 0.82 (Good) - Factual: 0.9, Completeness: 0.8, Citations: 0.7 - PASS
Response B: Overall 0.58 (Acceptable) - Factual: 0.7, Completeness: 0.5, Citations: 0.6 - NEEDS IMPROVEMENT
Response C: Overall 0.91 (Excellent) - Factual: 1.0, Completeness: 0.85, Citations: 0.9 - PASS
Recommendation: Focus on improving completeness for responses similar to task type B

"evaluation" 사용 중입니다. Create a test set for a research agent.

예상 결과:

Test Set: 5 tests created
simple_lookup: Single factual query (complexity: simple)
context_retrieval: Preference-based recommendation (complexity: medium)
multi_step_reasoning: Data analysis task (complexity: complex)
Expected tool calls: 1-3 for simple, 3-5 for medium, 5+ for complex

"evaluation" 사용 중입니다. Set up production monitoring for quality alerts.

예상 결과:

Production Monitor configured
Sample rate: 1% of interactions
Warning threshold: 85% pass rate
Critical threshold: 70% pass rate
Alert types: quality_drop, low_score, regression

보안 감사

안전

v5 • 1/16/2026

This is a legitimate evaluation framework skill containing only documentation and Python evaluation logic. All 79 static findings are FALSE POSITIVES caused by the scanner misinterpreting Markdown code blocks (``` delimiters) as shell backticks, dictionary structures as key files, and floating-point score values (0.0-1.0) as cryptographic algorithms. No network calls, no credential access, no command execution, and no data exfiltration patterns exist in the actual runtime code.

스캔된 파일

1,280

분석된 줄 수

발견 사항

총 감사 수

감사자: claude 감사 이력 보기 →

품질 점수

아키텍처

100

유지보수성

콘텐츠

커뮤니티

100

보안

사양 준수

만들 수 있는 것

Test agent performance

Systematically measure agent outputs against defined quality dimensions and pass thresholds

Validate context strategies

Compare how different context engineering approaches affect agent quality and token usage

Track quality trends

Monitor production agent quality over time with automated sampling and alert systems

이 프롬프트를 사용해 보세요

Create test set

Create a test set with 5 test cases of varying complexity (simple to very complex) for evaluating an agent that researches technical topics. Include complexity levels, tags, and ground truth expectations.

Design rubric

Design a multi-dimensional evaluation rubric for [use case: customer support agent]. Define 5 dimensions with weights, level descriptions from 1.0 to 0.0, and explain scoring rationale.

Run evaluation

Evaluate the following agent outputs against this rubric. For each output, provide dimension scores, overall score, and pass/fail determination with reasoning.

Build pipeline

Build an evaluation pipeline that runs on code changes. Include test set loading, parallel execution, result aggregation, and failure reporting to Slack.

모범 사례

Combine LLM automated evaluation with human review for edge cases and subtle issues
Evaluate outcomes, not specific execution paths, to account for multiple valid agent approaches
Track metrics over time to detect regressions and measure the impact of optimizations

피하기

Evaluating specific steps rather than outcomes, which penalizes valid alternative approaches
Using single metrics instead of multi-dimensional rubrics that capture different quality aspects
Testing only with unlimited context, missing performance cliffs that occur with realistic limits