# Reproduce AI research repositories with auditable evidence

Reproducing deep learning papers is slow and error-prone because commands, datasets, and assumptions are scattered across READMEs. This skill reads the repository first, selects the smallest documented target, and writes a standardized repro_outputs/ bundle with evidence, deviations, and human decision points.

## Install

```bash
npx skillstore add lllllllama/ai-research-reproduction
```

## Metadata

- - Slug: lllllllama-ai-research-reproduction
- - Version: 1.0.0
- - Author: lllllllama
- - GitHub username: lllllllama
- - License: MIT
- - Repository: https://github.com/lllllllama/rigorpilot-skills/tree/main/skills/ai-research-reproduction/
- - Ref: main
- - Supported tools: Claude, Codex, Claude Code
- - Risk level: medium
- - Risk factors: external\_commands, filesystem
- - Quality score: 79
- - Quality tier: bronze
- - Public page: https://skillstore.pages.dev/skills/lllllllama-ai-research-reproduction
- - Manifest: https://skillstore.pages.dev/api/skills/lllllllama-ai-research-reproduction/manifest

## Capabilities

- Reads a target AI repository's README and extracts documented commands and candidate targets before any execution
- Selects the smallest trustworthy reproduction target \(inference, evaluation, training smoke, then full training only on confirmation\)
- Writes a standardized repro\_outputs/ bundle with SUMMARY.md, COMMANDS.md, LOG.md, status.json, and optional PATCHES.md
- Records assumptions, deviations, and human decision points to keep reproduction auditable
- Enforces a conservative patch policy that prefers CLI flags and environment fixes over silent code changes
- Coordinates intake, setup, execution, training verification, analysis, and paper-gap resolution across sub-skills

## Use Cases

- Reproduce a published deep learning paper's inference: A researcher wants to verify a paper's reported numbers by running the documented evaluation script on the released checkpoint.
- Audit a forked repository's reproduction quality: A research engineer needs to assess whether a fork faithfully reproduces the original paper and where deviations occurred.
- Bootstrap a new repository's reproduction workflow: A lab lead wants a consistent, auditable workflow for new students to reproduce papers without silent protocol changes.

## Prompt Templates

### Start a new reproduction

```
Use the ai-research-reproduction skill on the repository at <PATH>. Read the README first, select the smallest documented target, and produce a repro_outputs/ bundle.
```

### Reproduce inference only

```
Reproduce only the documented inference target for this repository. Do not run training. Record the command, stdout/stderr, and any deviations in repro_outputs/.
```

### Reproduce evaluation with custom checkpoint

```
Reproduce the documented evaluation using the checkpoint at <CKPT_PATH>. Treat the README as the primary intent, and record any conflicts between README and code in LOG.md.
```

### Full reproduction with paper-gap resolution

```
Reproduce the documented training target end-to-end. If the README and paper conflict, pause for human review and use paper-context-resolver only for the narrow reproduction-critical gap. Write all evidence to repro_outputs/.
```

## Limitations

- Designed only for README-grounded AI code repositories, not paper summaries or open-ended research design
- Does not silently change model architecture, loss functions, metrics, or core training logic
- Full training runs require explicit user confirmation and are not auto-launched
- Quality of reproduction depends on the target repository's README clarity and completeness

## Best Practices

- Always start by reading the README and treat it as the primary reproduction intent before scanning other files.
- Prefer the smallest trustworthy target: inference before evaluation, evaluation before training smoke, full training only on confirmation.
- Keep patches conservative and auditable: prefer CLI flags, env vars, and dependency fixes over silent code changes.

## Anti Patterns

- Do not use this skill for paper summaries, generic environment setup, or isolated repo scanning outside reproduction.
- Do not silently change model architecture, loss functions, metrics, or training logic to make a run succeed.
- Do not launch full training runs without explicit user confirmation and a recorded decision point.

## Security Audit

- - Safe to publish: true
- - Audited at: 2026-06-09T09:19:48.791\+00:00
- - Summary: The static analyzer flagged 178 patterns with a risk score of 100/100, but the vast majority are false positives. Backtick patterns are markdown formatting in template and reference docs, not code execution. The 'weak crypto' hits are textual references to hash documentation, not active cryptography. The script orchestrate\_repro.py uses subprocess.run with shlex.split and proper timeouts to run documented README commands as part of the reproduction flow. No data exfiltration, no obfuscated payloads, no prompt injection attempts were found. The skill is a legitimate research reproduction orchestrator with expected external command and filesystem access.

## Stats

- - Views: 0
- - Downloads: 3
- - Favorites: 0
- - Popularity score: 0