技能 logprob-prefill-analysis
📊

logprob-prefill-analysis

安全 🌐 網路存取📁 檔案系統存取⚙️ 外部命令

Analyze model susceptibility to reward hacking

This skill provides documentation for running prefill sensitivity analysis to measure how easily AI models can be manipulated into generating exploit code. Researchers use it to compare token-count versus logprob metrics for predicting reward hacking susceptibility across model checkpoints.

支援: Claude Codex Code(CC)
📊 70 充足
1

下載技能 ZIP

2

在 Claude 中上傳

前往 設定 → 功能 → 技能 → 上傳技能

3

開啟並開始使用

測試它

正在使用「logprob-prefill-analysis」。 How do I run the full prefill sensitivity analysis pipeline?

預期結果:

  • Execute: python scripts/run_full_prefill_analysis.py
  • The orchestration script automatically discovers checkpoints from config.yaml
  • Add --dry-run flag to preview execution without running
  • Add --skip-logprob to run trajectory analysis only
  • Results are saved to timestamped directories with full experiment context

正在使用「logprob-prefill-analysis」。 How do I analyze token-based trajectories?

預期結果:

  • Run: python scripts/prefill_trajectory_analysis.py --run-dir results/prefill_sensitivity/{RUN_NAME}
  • Track minimum prefill tokens needed to trigger exploits across checkpoints
  • Set threshold (default 10) to define when models are easily exploitable
  • Output includes accessibility_distribution.png and time_to_threshold.png

正在使用「logprob-prefill-analysis」。 What are the key results from this analysis?

預期結果:

  • Logprob-based metrics show 66% better R2 than token-based for predicting exploitability
  • Token threshold fires 16.2 steps earlier on average than logprob threshold
  • Best practice: use SUM logprob for comparing across different prefill lengths

安全審計

安全
v5 • 1/17/2026

Pure documentation skill containing only SKILL.md markdown file with no executable code. The static analyzer incorrectly flagged documentation examples as security issues. Backticks in code blocks are markdown formatting, not shell execution. Hardcoded URLs in examples are localhost development endpoints. Hash-related terms in metadata are not cryptographic code. The skill documents a legitimate AI safety research pipeline for measuring model susceptibility to reward hacking.

2
已掃描檔案
518
分析行數
3
發現項
5
審計總數
審計者: claude 查看審計歷史 →

品質評分

38
架構
100
可維護性
87
內容
21
社群
100
安全
91
規範符合性

你能建構什麼

Measure Model Vulnerability

Evaluate how susceptible trained models are to reward hacking by measuring prefill token thresholds and logprob scores across checkpoints

Track Training Progression

Analyze how exploit accessibility changes during SFT training to identify when models become vulnerable

Compare Prediction Metrics

Compare R2 values between token-based and logprob-based metrics for predicting when models become exploitable

試試這些提示

Run Full Analysis
How do I run the full prefill sensitivity analysis pipeline using the run_full_prefill_analysis.py script?
Analyze Trajectories
What commands do I use to analyze token-based trajectories and compute minimum prefill tokens needed for exploits?
Compute Logprobs
Show me how to compute prefill logprobs for a model checkpoint and batch process multiple checkpoints
Compare Metrics
How do I merge token-based and logprob-based metrics to compare their predictive power using integrate_logprob_trajectory.py?

最佳實務

  • Use experiment context logging (--use-run-context) to capture reproducibility metadata including Git commit, Python version, and environment details
  • Start with --dry-run to verify configuration before executing long-running analysis pipelines
  • Use the threshold parameter (default 10) to define when a model is considered easily exploitable based on min_prefill tokens

避免

  • Running full analysis without first verifying checkpoint availability in config.yaml
  • Ignoring the distinction between word tokens and subword tokens when interpreting results
  • Using mean logprob instead of sum logprob when comparing across different prefill lengths

常見問題

What models and frameworks does this analysis support?
Works with SFT checkpoints served via vLLM. gpt-oss models use Harmony format with thinking field auto-detection.
What compute resources are required?
GPU recommended for logprob computation. CUDA OOM can be addressed with --max-samples 50 or --dtype float16.
How long does full analysis take?
Depends on checkpoint count and prefill levels. The orchestration script processes all checkpoints automatically.
Is data saved securely?
Results written to local results/ directory. No external data transmission occurs during analysis execution.
What if vLLM server fails to start?
Ensure server fully starts before evaluation. Check logs for Uvicorn running message. Use pkill to clean up stuck processes.
How does this differ from standard model evaluation?
Tracks exploit accessibility over training progression, comparing how easily models can be manipulated via prefill tokens.

開發者詳情

檔案結構

📄 SKILL.md