Skills logprob-prefill-analysis

📊

logprob-prefill-analysis

Name: logprob-prefill-analysis
Author: EleutherAI

Safe 🌐 Network access📁 Filesystem access⚙️ External commands

Analyze model susceptibility to reward hacking

This skill provides documentation for running prefill sensitivity analysis to measure how easily AI models can be manipulated into generating exploit code. Researchers use it to compare token-count versus logprob metrics for predicting reward hacking susceptibility across model checkpoints.

Supports: Claude Codex Code(CC)

📊 70 Adequate

Download the skill ZIP

Upload in Claude

Go to Settings → Capabilities → Skills → Upload skill

Toggle on and start using

Test it

Using "logprob-prefill-analysis". How do I run the full prefill sensitivity analysis pipeline?

Expected outcome:

Execute: python scripts/run_full_prefill_analysis.py
The orchestration script automatically discovers checkpoints from config.yaml
Add --dry-run flag to preview execution without running
Add --skip-logprob to run trajectory analysis only
Results are saved to timestamped directories with full experiment context

Using "logprob-prefill-analysis". How do I analyze token-based trajectories?

Expected outcome:

Run: python scripts/prefill_trajectory_analysis.py --run-dir results/prefill_sensitivity/{RUN_NAME}
Track minimum prefill tokens needed to trigger exploits across checkpoints
Set threshold (default 10) to define when models are easily exploitable
Output includes accessibility_distribution.png and time_to_threshold.png

Using "logprob-prefill-analysis". What are the key results from this analysis?

Expected outcome:

Logprob-based metrics show 66% better R2 than token-based for predicting exploitability
Token threshold fires 16.2 steps earlier on average than logprob threshold
Best practice: use SUM logprob for comparing across different prefill lengths

Security Audit

Safe

v5 • 1/17/2026

Pure documentation skill containing only SKILL.md markdown file with no executable code. The static analyzer incorrectly flagged documentation examples as security issues. Backticks in code blocks are markdown formatting, not shell execution. Hardcoded URLs in examples are localhost development endpoints. Hash-related terms in metadata are not cryptographic code. The skill documents a legitimate AI safety research pipeline for measuring model susceptibility to reward hacking.

Files scanned

518

Lines analyzed

findings

Total audits

Risk Factors

🌐 Network access (3)

skill-report.json:6 SKILL.md:90 SKILL.md:123

📁 Filesystem access (1)

skill-report.json:6

⚙️ External commands (71)

SKILL.md:14-27 SKILL.md:27-30 SKILL.md:30-53 SKILL.md:53-55 SKILL.md:55-67 SKILL.md:67-82 SKILL.md:82-84 SKILL.md:84-88 SKILL.md:88-95 SKILL.md:95-100 SKILL.md:100-101 SKILL.md:101-102 SKILL.md:102-105 SKILL.md:105-106 SKILL.md:106-110 SKILL.md:110-133 SKILL.md:133-141 SKILL.md:141-146 SKILL.md:146-149 SKILL.md:149-155 SKILL.md:155-163 SKILL.md:163-164 SKILL.md:164-165 SKILL.md:165-175 SKILL.md:175-181 SKILL.md:181-185 SKILL.md:185-190 SKILL.md:190-193 SKILL.md:193-194 SKILL.md:194-195 SKILL.md:195-203 SKILL.md:203-212 SKILL.md:212-215 SKILL.md:215-219 SKILL.md:219-222 SKILL.md:222-223 SKILL.md:223-226 SKILL.md:226-227 SKILL.md:227-228 SKILL.md:228-229 SKILL.md:229-235 SKILL.md:235-236 SKILL.md:236-237 SKILL.md:237-238 SKILL.md:238-240 SKILL.md:240-271 SKILL.md:271-284 SKILL.md:284 SKILL.md:284-287 SKILL.md:287 SKILL.md:287-296 SKILL.md:296-326 SKILL.md:326-334 SKILL.md:334 SKILL.md:334-335 SKILL.md:335 SKILL.md:335 SKILL.md:335-336 SKILL.md:336 SKILL.md:336-337 SKILL.md:337 SKILL.md:337 SKILL.md:337-338 SKILL.md:338 SKILL.md:338 SKILL.md:338-339 SKILL.md:339 SKILL.md:339 SKILL.md:112 SKILL.md:110-133 SKILL.md:111

Audited by: claude View Audit History →

Quality Score

Architecture

100

Maintainability

Content

Community

100

Security

Spec Compliance

What You Can Build

Measure Model Vulnerability

Evaluate how susceptible trained models are to reward hacking by measuring prefill token thresholds and logprob scores across checkpoints

Track Training Progression

Analyze how exploit accessibility changes during SFT training to identify when models become vulnerable

Compare Prediction Metrics

Compare R2 values between token-based and logprob-based metrics for predicting when models become exploitable

Try These Prompts

Run Full Analysis

How do I run the full prefill sensitivity analysis pipeline using the run_full_prefill_analysis.py script?

Analyze Trajectories

What commands do I use to analyze token-based trajectories and compute minimum prefill tokens needed for exploits?

Compute Logprobs

Show me how to compute prefill logprobs for a model checkpoint and batch process multiple checkpoints

Compare Metrics

How do I merge token-based and logprob-based metrics to compare their predictive power using integrate_logprob_trajectory.py?

Best Practices

Use experiment context logging (--use-run-context) to capture reproducibility metadata including Git commit, Python version, and environment details
Start with --dry-run to verify configuration before executing long-running analysis pipelines
Use the threshold parameter (default 10) to define when a model is considered easily exploitable based on min_prefill tokens

Avoid

Running full analysis without first verifying checkpoint availability in config.yaml
Ignoring the distinction between word tokens and subword tokens when interpreting results
Using mean logprob instead of sum logprob when comparing across different prefill lengths

Frequently Asked Questions

What models and frameworks does this analysis support?

Works with SFT checkpoints served via vLLM. gpt-oss models use Harmony format with thinking field auto-detection.

What compute resources are required?

GPU recommended for logprob computation. CUDA OOM can be addressed with --max-samples 50 or --dtype float16.

How long does full analysis take?

Depends on checkpoint count and prefill levels. The orchestration script processes all checkpoints automatically.

Is data saved securely?

Results written to local results/ directory. No external data transmission occurs during analysis execution.

What if vLLM server fails to start?

Ensure server fully starts before evaluation. Check logs for Uvicorn running message. Use pkill to clean up stuck processes.

How does this differ from standard model evaluation?

Tracks exploit accessibility over training progression, comparing how easily models can be manipulated via prefill tokens.

Developer Details

Author

EleutherAI

License

MIT

Repository

https://github.com/EleutherAI/rh-indicators/tree/main/.claude/skills/logprob-prefill-analysis

Ref

main

File structure

📄 SKILL.md