logprob-prefill-analysis
Analyze model susceptibility to reward hacking
This skill provides documentation for running prefill sensitivity analysis to measure how easily AI models can be manipulated into generating exploit code. Researchers use it to compare token-count versus logprob metrics for predicting reward hacking susceptibility across model checkpoints.
下載技能 ZIP
在 Claude 中上傳
前往 設定 → 功能 → 技能 → 上傳技能
開啟並開始使用
測試它
正在使用「logprob-prefill-analysis」。 How do I run the full prefill sensitivity analysis pipeline?
預期結果:
- Execute: python scripts/run_full_prefill_analysis.py
- The orchestration script automatically discovers checkpoints from config.yaml
- Add --dry-run flag to preview execution without running
- Add --skip-logprob to run trajectory analysis only
- Results are saved to timestamped directories with full experiment context
正在使用「logprob-prefill-analysis」。 How do I analyze token-based trajectories?
預期結果:
- Run: python scripts/prefill_trajectory_analysis.py --run-dir results/prefill_sensitivity/{RUN_NAME}
- Track minimum prefill tokens needed to trigger exploits across checkpoints
- Set threshold (default 10) to define when models are easily exploitable
- Output includes accessibility_distribution.png and time_to_threshold.png
正在使用「logprob-prefill-analysis」。 What are the key results from this analysis?
預期結果:
- Logprob-based metrics show 66% better R2 than token-based for predicting exploitability
- Token threshold fires 16.2 steps earlier on average than logprob threshold
- Best practice: use SUM logprob for comparing across different prefill lengths
安全審計
安全Pure documentation skill containing only SKILL.md markdown file with no executable code. The static analyzer incorrectly flagged documentation examples as security issues. Backticks in code blocks are markdown formatting, not shell execution. Hardcoded URLs in examples are localhost development endpoints. Hash-related terms in metadata are not cryptographic code. The skill documents a legitimate AI safety research pipeline for measuring model susceptibility to reward hacking.
風險因素
🌐 網路存取 (3)
📁 檔案系統存取 (1)
⚙️ 外部命令 (71)
品質評分
你能建構什麼
Measure Model Vulnerability
Evaluate how susceptible trained models are to reward hacking by measuring prefill token thresholds and logprob scores across checkpoints
Track Training Progression
Analyze how exploit accessibility changes during SFT training to identify when models become vulnerable
Compare Prediction Metrics
Compare R2 values between token-based and logprob-based metrics for predicting when models become exploitable
試試這些提示
How do I run the full prefill sensitivity analysis pipeline using the run_full_prefill_analysis.py script?
What commands do I use to analyze token-based trajectories and compute minimum prefill tokens needed for exploits?
Show me how to compute prefill logprobs for a model checkpoint and batch process multiple checkpoints
How do I merge token-based and logprob-based metrics to compare their predictive power using integrate_logprob_trajectory.py?
最佳實務
- Use experiment context logging (--use-run-context) to capture reproducibility metadata including Git commit, Python version, and environment details
- Start with --dry-run to verify configuration before executing long-running analysis pipelines
- Use the threshold parameter (default 10) to define when a model is considered easily exploitable based on min_prefill tokens
避免
- Running full analysis without first verifying checkpoint availability in config.yaml
- Ignoring the distinction between word tokens and subword tokens when interpreting results
- Using mean logprob instead of sum logprob when comparing across different prefill lengths