evaluation
為代理系統建立評估框架
متاح أيضًا من: sickn33,ChakshuGautam,muratcankoylan
代理系統缺乏可靠的品質測量方法。此技能提供結構化的評估框架,包含多維度評分標準、測試集設計和生產監控,以系統化地測量代理效能。
تنزيل ZIP المهارة
رفع في Claude
اذهب إلى Settings → Capabilities → Skills → Upload skill
فعّل وابدأ الاستخدام
اختبرها
استخدام "evaluation". Evaluate these 3 agent responses for factual accuracy, completeness, and citation quality.
النتيجة المتوقعة:
- Response A: Overall 0.82 (Good) - Factual: 0.9, Completeness: 0.8, Citations: 0.7 - PASS
- Response B: Overall 0.58 (Acceptable) - Factual: 0.7, Completeness: 0.5, Citations: 0.6 - NEEDS IMPROVEMENT
- Response C: Overall 0.91 (Excellent) - Factual: 1.0, Completeness: 0.85, Citations: 0.9 - PASS
- Recommendation: Focus on improving completeness for responses similar to task type B
استخدام "evaluation". Create a test set for a research agent.
النتيجة المتوقعة:
- Test Set: 5 tests created
- simple_lookup: Single factual query (complexity: simple)
- context_retrieval: Preference-based recommendation (complexity: medium)
- multi_step_reasoning: Data analysis task (complexity: complex)
- Expected tool calls: 1-3 for simple, 3-5 for medium, 5+ for complex
استخدام "evaluation". Set up production monitoring for quality alerts.
النتيجة المتوقعة:
- Production Monitor configured
- Sample rate: 1% of interactions
- Warning threshold: 85% pass rate
- Critical threshold: 70% pass rate
- Alert types: quality_drop, low_score, regression
التدقيق الأمني
آمنThis is a legitimate evaluation framework skill containing only documentation and Python evaluation logic. All 79 static findings are FALSE POSITIVES caused by the scanner misinterpreting Markdown code blocks (``` delimiters) as shell backticks, dictionary structures as key files, and floating-point score values (0.0-1.0) as cryptographic algorithms. No network calls, no credential access, no command execution, and no data exfiltration patterns exist in the actual runtime code.
عوامل الخطر
⚙️ الأوامر الخارجية (20)
🌐 الوصول إلى الشبكة (1)
📁 الوصول إلى نظام الملفات (1)
درجة الجودة
ماذا يمكنك بناءه
測試代理效能
根據定義的品質維度和通過門檻系統化地測量代理輸出
驗證上下文策略
比較不同的上下文工程方法如何影響代理品質和令牌使用量
追蹤品質趨勢
透過自動化抽樣和警報系統監控生產環境中代理品質隨時間的變化
جرّب هذه الموجهات
Create a test set with 5 test cases of varying complexity (simple to very complex) for evaluating an agent that researches technical topics. Include complexity levels, tags, and ground truth expectations.
Design a multi-dimensional evaluation rubric for [use case: customer support agent]. Define 5 dimensions with weights, level descriptions from 1.0 to 0.0, and explain scoring rationale.
Evaluate the following agent outputs against this rubric. For each output, provide dimension scores, overall score, and pass/fail determination with reasoning.
Build an evaluation pipeline that runs on code changes. Include test set loading, parallel execution, result aggregation, and failure reporting to Slack.
أفضل الممارسات
- 結合 LLM 自動化評估與人工審查,以處理邊緣案例和微妙問題
- 評估結果而非特定執行路徑,以考量多種有效的代理方法
- 隨時間追蹤指標以偵測退化並測量最佳化的影響
تجنب
- 評估特定步驟而非結果,這會懲罰有效的替代方法
- 使用單一指標而非能捕捉不同品質面向的多維度評分標準
- 僅在無限上下文下測試,錯過在實際限制下發生的效能懸崖