スキル agent-evaluation
🧪

agent-evaluation

安全

Evaluate and Test LLM Agent Performance

LLM agents often fail in production despite passing benchmarks. This skill provides behavioral testing, capability assessments, and reliability metrics to catch issues before deployment.

対応: Claude Codex Code(CC)
🥉 74 ブロンズ
1

スキルZIPをダウンロード

2

Claudeでアップロード

設定 → 機能 → スキル → スキルをアップロードへ移動

3

オンにして利用開始

テストする

「agent-evaluation」を使用しています。 Run behavioral contract test on customer support agent

期待される結果:

Test Results: 5/5 invariants passed across 20 test runs. Consistency score: 94%. Minor variance detected in response tone under high-load scenarios.

「agent-evaluation」を使用しています。 Adversarial testing for code generation agent

期待される結果:

Identified 3 failure modes: (1) Silent failure on malformed syntax, (2) Over-confident incorrect answers on ambiguous specs, (3) Resource exhaustion on recursive tasks.

セキュリティ監査

安全
v1 • 2/24/2026

All static analysis findings determined to be false positives. The external_commands pattern matches markdown backtick formatting for inline code references, not shell execution. The unicode escape sequence is a standard em-dash character in the description. No weak cryptography exists—this is a documentation file with no executable code. The skill describes LLM agent evaluation methodologies and contains no security risks.

1
スキャンされたファイル
69
解析された行数
0
検出結果
1
総監査数
セキュリティ問題は見つかりませんでした
監査者: claude

品質スコア

38
アーキテクチャ
100
保守性
87
コンテンツ
50
コミュニティ
100
セキュリティ
91
仕様準拠

作れるもの

Pre-Production Agent Validation

Run comprehensive behavioral tests on agents before deploying to production environments to catch regressions and capability gaps.

Agent Comparison and Selection

Evaluate multiple agent configurations or models against standardized benchmarks to select the best performer for specific tasks.

Continuous Agent Monitoring

Implement ongoing reliability metrics and regression tests to detect performance degradation in deployed agents.

これらのプロンプトを試す

Basic Agent Test
Test this agent on a simple task and verify the output matches expected behavior. Run the test 3 times and report any inconsistencies.
Behavioral Contract Definition
Define behavioral invariants that this agent must maintain across all inputs. Create test cases that verify each invariant holds true.
Adversarial Test Suite
Design edge cases and adversarial inputs that could break this agent. Include malformed inputs, ambiguous requests, and conflicting constraints.
Statistical Reliability Analysis
Run this agent on the same task 10 times. Analyze the distribution of outputs, calculate consistency metrics, and identify failure patterns.

ベストプラクティス

  • Run tests multiple times and analyze statistical distributions rather than single outcomes
  • Focus on behavioral invariants rather than exact output string matching
  • Include adversarial inputs that actively try to break the agent

回避

  • Testing agents with single runs—LLM outputs vary and require statistical analysis
  • Only testing happy paths—edge cases reveal critical failure modes
  • Optimizing agents for specific metrics rather than actual task performance

よくある質問

Why do agents pass benchmarks but fail in production?
Benchmarks often use clean, well-defined tasks while production involves ambiguous, real-world scenarios. This skill bridges that gap with behavioral testing that mirrors actual use cases.
How many times should I run each test?
Minimum 3-5 runs for basic tests, 10+ for statistical reliability analysis. More runs provide better confidence in consistency metrics but increase evaluation time.
What is a behavioral contract?
A behavioral contract defines invariants the agent must maintain—such as never exposing sensitive data, always asking clarifying questions for ambiguous requests, or maintaining consistent tone across sessions.
Can this skill test any type of LLM agent?
Yes, the evaluation methodologies apply to conversational agents, code generation agents, task automation agents, and multi-agent systems. Test design must match the agent's domain.
How do I handle flaky tests?
Accept that some variability is inherent to LLMs. Use statistical thresholds (e.g., 90% pass rate) rather than requiring 100% consistency. Track flakiness as a metric itself.
What is data leakage in agent evaluation?
Data leakage occurs when test data appears in training data or prompts, causing artificially inflated scores. Always verify test inputs are independent from any data the agent has seen.

開発者の詳細

ファイル構成

📄 SKILL.md