# Extract Structured Data From Scientific PDFs

Research teams often need consistent datasets from many scientific PDFs. This skill guides extraction, validation, and export into analysis-ready files.

## Install

```bash
npx skillstore add brunoasm/extract-from-pdfs
```

## Metadata

- - Slug: brunoasm-extract-from-pdfs
- - Version: 1.0.0
- - Author: brunoasm
- - GitHub username: brunoasm
- - License: MIT
- - Repository: https://github.com/brunoasm/my\_claude\_skills/tree/main/extract\_from\_pdfs
- - Ref: main
- - Supported tools: Claude, Codex, Claude Code
- - Risk level: medium
- - Risk factors: external\_commands, network, env\_access, filesystem
- - Quality score: 50
- - Quality tier: warning
- - Public page: https://skillstore.pages.dev/skills/brunoasm-extract-from-pdfs
- - Manifest: https://skillstore.pages.dev/api/skills/brunoasm-extract-from-pdfs/manifest

## Capabilities

- Organizes paper metadata from BibTeX, RIS, directories, or DOI lists.
- Filters abstracts with Claude models or a local Ollama backend.
- Extracts structured data from PDFs using configurable JSON schemas.
- Repairs and validates extracted JSON against a schema.
- Enriches extracted fields with scientific databases such as GBIF, GeoNames, PubChem, and NCBI.
- Exports cleaned data to JSON, CSV, Excel, SQLite, Python, or R formats.

## Use Cases

- Build A Systematic Review Dataset: Convert a library of research PDFs into structured records for screening, extraction, and meta-analysis.
- Create A Domain Research Database: Extract repeated observations, measurements, or study attributes into a reusable database.
- Validate Extraction Quality: Sample papers, add ground truth annotations, and calculate precision, recall, and F1 metrics.

## Prompt Templates

### Start A Small Extraction

```
Help me extract structured data from 10 scientific PDFs. Ask me for the research goal, PDF organization, and fields to extract.
```

### Design An Extraction Schema

```
Create a domain-specific extraction schema for my systematic review. Include objective, instructions, output fields, and validation notes.
```

### Run The Full Pipeline

```
Guide me through the complete PDF extraction pipeline using my metadata file, schema, and preferred export format.
```

### Audit Extraction Quality

```
Prepare a validation set, define annotation guidance, and calculate precision, recall, and F1 for each extracted field.
```

## Limitations

- Requires user-provided schemas and domain criteria for accurate extraction.
- Cloud model backends may send PDFs, abstracts, and extracted data to external APIs.
- Validation API coverage is strongest for supported taxonomy, geography, chemistry, and gene use cases.
- Quality metrics require manually annotated ground truth examples.

## Best Practices

- Start with two or three representative PDFs before processing the full collection.
- Use a precise schema with required fields, examples, and rules for missing values.
- Run validation on a manually annotated sample before relying on final metrics.

## Anti Patterns

- Do not send confidential PDFs to cloud APIs without approval from the data owner.
- Do not use generic extraction prompts when the review has strict inclusion criteria.
- Do not publish extracted datasets without reviewing validation errors and sampled source evidence.

## Security Audit

- - Safe to publish: true
- - Audited at: 2026-06-28T17:51:42.532\+00:00
- - Summary: The static analyzer found many patterns, but most high weak-cryptography, Ruby backtick, and sensitive-file findings are false positives from Markdown, schema text, or normal export code. Medium risk remains because the skill intentionally reads local PDFs, writes datasets, uses API credentials, sends research content to model and validation services, and documents an optional pipe-to-shell installer.

## Stats

- - Views: 473
- - Downloads: 7
- - Favorites: 0
- - Popularity score: 0
