技能 evaluation

📊

evaluation

Name: evaluation
Author: Asmayaseen

安全 ⚙️ 外部命令🌐 網路存取📁 檔案系統存取

為代理系統建立評估框架

也可從以下取得: ChakshuGautam,muratcankoylan,sickn33

代理系統缺乏可靠的品質測量方法。此技能提供結構化的評估框架,包含多維度評分標準、測試集設計和生產監控,以系統化地測量代理效能。

支援: Claude Codex Code(CC)

🥉 76 青銅

下載技能 ZIP

在 Claude 中上傳

前往設定 → 功能 → 技能 → 上傳技能

開啟並開始使用

測試它

正在使用「evaluation」。 Evaluate these 3 agent responses for factual accuracy, completeness, and citation quality.

預期結果:

Response A: Overall 0.82 (Good) - Factual: 0.9, Completeness: 0.8, Citations: 0.7 - PASS
Response B: Overall 0.58 (Acceptable) - Factual: 0.7, Completeness: 0.5, Citations: 0.6 - NEEDS IMPROVEMENT
Response C: Overall 0.91 (Excellent) - Factual: 1.0, Completeness: 0.85, Citations: 0.9 - PASS
Recommendation: Focus on improving completeness for responses similar to task type B

正在使用「evaluation」。 Create a test set for a research agent.

預期結果:

Test Set: 5 tests created
simple_lookup: Single factual query (complexity: simple)
context_retrieval: Preference-based recommendation (complexity: medium)
multi_step_reasoning: Data analysis task (complexity: complex)
Expected tool calls: 1-3 for simple, 3-5 for medium, 5+ for complex

正在使用「evaluation」。 Set up production monitoring for quality alerts.

預期結果:

Production Monitor configured
Sample rate: 1% of interactions
Warning threshold: 85% pass rate
Critical threshold: 70% pass rate
Alert types: quality_drop, low_score, regression

安全審計

安全

v5 • 1/16/2026

This is a legitimate evaluation framework skill containing only documentation and Python evaluation logic. All 79 static findings are FALSE POSITIVES caused by the scanner misinterpreting Markdown code blocks (``` delimiters) as shell backticks, dictionary structures as key files, and floating-point score values (0.0-1.0) as cryptographic algorithms. No network calls, no credential access, no command execution, and no data exfiltration patterns exist in the actual runtime code.

已掃描檔案

1,280

分析行數

發現項

審計總數

審計者: claude 查看審計歷史 →

品質評分

架構

100

可維護性

內容

社群

100

安全

規範符合性

你能建構什麼

測試代理效能

根據定義的品質維度和通過門檻系統化地測量代理輸出

驗證上下文策略

比較不同的上下文工程方法如何影響代理品質和令牌使用量

追蹤品質趨勢

透過自動化抽樣和警報系統監控生產環境中代理品質隨時間的變化

試試這些提示

建立測試集

Create a test set with 5 test cases of varying complexity (simple to very complex) for evaluating an agent that researches technical topics. Include complexity levels, tags, and ground truth expectations.

設計評分標準

Design a multi-dimensional evaluation rubric for [use case: customer support agent]. Define 5 dimensions with weights, level descriptions from 1.0 to 0.0, and explain scoring rationale.

執行評估

Evaluate the following agent outputs against this rubric. For each output, provide dimension scores, overall score, and pass/fail determination with reasoning.

建立管道

Build an evaluation pipeline that runs on code changes. Include test set loading, parallel execution, result aggregation, and failure reporting to Slack.