agent-evaluation
Evaluate and Test LLM Agent Performance
LLM agents often fail in production despite passing benchmarks. This skill provides behavioral testing, capability assessments, and reliability metrics to catch issues before deployment.
スキルZIPをダウンロード
Claudeでアップロード
設定 → 機能 → スキル → スキルをアップロードへ移動
オンにして利用開始
テストする
「agent-evaluation」を使用しています。 Run behavioral contract test on customer support agent
期待される結果:
Test Results: 5/5 invariants passed across 20 test runs. Consistency score: 94%. Minor variance detected in response tone under high-load scenarios.
「agent-evaluation」を使用しています。 Adversarial testing for code generation agent
期待される結果:
Identified 3 failure modes: (1) Silent failure on malformed syntax, (2) Over-confident incorrect answers on ambiguous specs, (3) Resource exhaustion on recursive tasks.
セキュリティ監査
安全All static analysis findings determined to be false positives. The external_commands pattern matches markdown backtick formatting for inline code references, not shell execution. The unicode escape sequence is a standard em-dash character in the description. No weak cryptography exists—this is a documentation file with no executable code. The skill describes LLM agent evaluation methodologies and contains no security risks.
品質スコア
作れるもの
Pre-Production Agent Validation
Run comprehensive behavioral tests on agents before deploying to production environments to catch regressions and capability gaps.
Agent Comparison and Selection
Evaluate multiple agent configurations or models against standardized benchmarks to select the best performer for specific tasks.
Continuous Agent Monitoring
Implement ongoing reliability metrics and regression tests to detect performance degradation in deployed agents.
これらのプロンプトを試す
Test this agent on a simple task and verify the output matches expected behavior. Run the test 3 times and report any inconsistencies.
Define behavioral invariants that this agent must maintain across all inputs. Create test cases that verify each invariant holds true.
Design edge cases and adversarial inputs that could break this agent. Include malformed inputs, ambiguous requests, and conflicting constraints.
Run this agent on the same task 10 times. Analyze the distribution of outputs, calculate consistency metrics, and identify failure patterns.
ベストプラクティス
- Run tests multiple times and analyze statistical distributions rather than single outcomes
- Focus on behavioral invariants rather than exact output string matching
- Include adversarial inputs that actively try to break the agent
回避
- Testing agents with single runs—LLM outputs vary and require statistical analysis
- Only testing happy paths—edge cases reveal critical failure modes
- Optimizing agents for specific metrics rather than actual task performance