技能 llm-evaluation

🧪

llm-evaluation

Name: llm-evaluation
Author: wshobson

安全 🌐 網路存取⚙️ 外部命令

建立可靠的 LLM 評估計劃

也可從以下取得: sickn33

您需要一致的方法來衡量 LLM 品質和回歸問題。此技能提供指標、人工審查指導和測試框架，以實現可靠的 AI 評估。

支援: Claude Codex Code(CC)

📊 69 充足

下載技能 ZIP

在 Claude 中上傳

前往設定 → 功能 → 技能 → 上傳技能

開啟並開始使用

測試它

正在使用「llm-evaluation」。 Propose an evaluation plan for a RAG assistant.

預期結果:

Automated metrics: MRR, NDCG, Precision at K
Human ratings: accuracy, relevance, helpfulness
LLM judge: pairwise comparison for final answers
Regression rule: fail if accuracy drops more than 5 percent

正在使用「llm-evaluation」。 What metrics should I use to evaluate a summarization model?

預期結果:

ROUGE for n-gram overlap with reference summaries
BERTScore for semantic similarity using embeddings
Factuality score to verify claims against source text
Human readability assessment for coherence and fluency

正在使用「llm-evaluation」。 How do I detect if my model is regressing?

預期結果:

Store baseline scores from a reference model or previous version
Compare new model scores against baseline on same test set
Flag metrics where relative change exceeds your threshold
Run statistical significance test to confirm real differences

安全審計

安全

v4 • 1/17/2026

This skill contains only static documentation (SKILL.md) with no executable files. All static findings are false positives: markdown code block backticks were misidentified as Ruby/shell command execution, and JSON metadata fields were misclassified as cryptographic issues. The skill provides evaluation guidance only with no data access, network activity, or command execution capability.

已掃描檔案

649

分析行數

發現項

審計總數

審計者: claude 查看審計歷史 →

品質評分

架構

100

可維護性

內容

社群

100

安全

規範符合性

你能建構什麼

CI 中的回歸閾門

設計評估清單和閾值，以阻止降低品質的模型變更。

模型比較簡報

使用人工評級和自動化分數比較兩個模型選項，以撰寫決策備忘錄。

基準測試研究計劃

制定包含數據集、指標和報告結構的基準測試計劃。

試試這些提示

初始評估計劃

為客服聊天機器人建立包含 3 個自動化指標和 2 個人工標準的基本評估計劃。

指標選擇指南

推薦摘要任務的指標，說明每個指標的捕獲內容，並指出每個指標的一個限制。

LLM 判斷提示

草擬配對 LLM 判斷提示，以比較回應 A 和 B 的準確性、有用性和清晰度。

A/B 測試分析

描述 A/B 評估的統計測試計劃，包括樣本量指導和效應量報告。

最佳實務

同時使用多個指標和人工審查
使用具有代表性和多樣性的數據進行測試
追蹤基線和統計顯著性

避免

依賴單一指標
在訓練數據上測試
忽視小樣本中的變異

常見問題

這與 Claude 和 Codex 相容嗎？

是的，此指導模型無關，適用於 Claude、Codex、Claude Code 和其他 LLM。

此技能的限制是什麼？

它提供指導和示例，但此目錄中沒有可執行的評估管線。

如何與我的技術堆疊整合？

將指標和工作流程映射到您現有的評估或 CI 工具。

它是否存取或儲存我的數據？

不，它是靜態文檔，不會讀取或傳輸數據。

如果分數不穩定怎麼辦？

增加樣本量，審查變異量，並在做出決策前添加人工驗證。

這與基準測試列表有何不同？

它結合了指標、人工審查和測試策略，而不僅僅是列出基準測試。

開發者詳情

作者

wshobson

授權

MIT

儲存庫

https://github.com/wshobson/agents/tree/main/plugins/llm-application-dev/skills/llm-evaluation

引用

main

檔案結構

📄 SKILL.md

llm-evaluation

測試它

安全審計

風險因素

品質評分

你能建構什麼

CI 中的回歸閾門

模型比較簡報

基準測試研究計劃

試試這些提示

最佳實務

避免

常見問題

開發者詳情