技能 pdf-page-extract

📄

pdf-page-extract

Name: pdf-page-extract
Author: AbeJitsu

安全 ⚡ 包含脚本

Extract PDF text spans and rendered images

Extract detailed text and visual data from PDF pages. This skill captures font metadata, text positions, and rendered images to enable accurate AI-driven HTML generation workflows.

支持: Claude Codex Code(CC)

📊 69 充足

下载技能 ZIP

在 Claude 中上传

前往设置 → 功能 → 技能 → 上传技能

开启并开始使用

测试它

正在使用“pdf-page-extract”。 Extract chapter 2 from the PDF, pages 15-28

预期结果:

Created rich_extraction.json with 150 text spans across 14 pages
Generated 14 PNG images at 300 DPI resolution
Page mapping saved to analysis/page_mapping.json with 14 entries
Extracted 3 embedded images from pages 17, 22, and 25
All artifacts saved to output/chapter_02/page_artifacts/

正在使用“pdf-page-extract”。 Extract embedded images from chapter 3, pages 5-10

预期结果:

Found and extracted 8 embedded images across 6 pages
Saved images to output/chapter_03/images/ with metadata
Identified 2 pages as image-only requiring OCR
Font analysis found 5 unique font styles in text regions

安全审计

安全

v5 • 1/16/2026

This is a pure documentation skill with no executable code. The SKILL.md contains only instructions for running external Python scripts (PyMuPDF, pdfplumber) for PDF extraction. All 53 static findings are false positives: hash values flagged as weak crypto, Python script invocations flagged as shell execution, relative paths flagged as path traversal, and file checks flagged as reconnaissance. Commands use hardcoded paths with no user input - no injection risk. Legitimate document processing tool.

已扫描文件

476

分析行数

发现项

审计总数

风险因素

⚡ 包含脚本 (1)

SKILL.md:25-205

审计者: claude 查看审计历史 →

质量评分

架构

100

可维护性

内容

社区

100

安全

规范符合性

你能构建什么

Convert PDF documents to HTML

Extract structured data from PDF documents to enable AI-powered HTML generation for web publishing.

Preserve document formatting

Capture font sizes, styles, and layouts from PDFs to recreate document formatting in other formats.

Analyze PDF document structure

Extract text and metadata from PDFs for content analysis, auditing, or data extraction pipelines.

试试这些提示

Basic extraction

Extract all pages from chapter 3 of the PDF. Run rich_extractor.py to get text spans with font metadata, render each page to PNG at 300 DPI, and create the page mapping.

Specific page range

Extract pages 15 through 28 from the PDF. Use read_page_footers.py first to establish the page mapping, then run rich_extractor.py for the specified range, and render each page to PNG.

Image extraction

Extract all embedded images from pages 5-10 in chapter 2. Run extract_images.py for each page and save the images with their metadata to the output directory.

Complete chapter pipeline

Run the complete extraction pipeline for chapter 4: first establish page mapping using read_page_footers.py, then run rich_extractor.py to extract all text spans with metadata, render each page to high-resolution PNG, and extract any embedded images. Verify all output files are valid.

最佳实践

Verify PDF file path and accessibility before starting extraction
Run page mapping first to ensure correct PDF-to-book page correlation
Check output directories exist before running extraction commands
Validate JSON and PNG files after extraction for completeness

避免

Running extraction without first establishing page mapping
Skipping validation of output files after extraction
Extracting without specifying the correct output directory structure

常见问题

What Python libraries are required?

PyMuPDF (fitz) and pdfplumber are required. Install with pip install pymupdf pdfplumber.

Can this process encrypted PDFs?

No. The PDF must be readable and not password protected for text extraction to work.

How does page mapping work?

read_page_footers.py scans footer text to create a mapping between PDF indices and book page numbers.

Is the extracted data safe?

Yes. All extraction runs locally on your machine. No data is sent to external services.

What if text extraction returns empty results?

The page may be image-only. The skill marks such pages with page_type: image_only for OCR processing.

How is this different from pdf-text-extract skill?

This skill extracts rich metadata including font sizes, positions, and renders PNG images for visual reference.

开发者详情

作者

AbeJitsu

许可证

MIT

仓库

https://github.com/AbeJitsu/Game-Settings-Panel/tree/main/.claude/skills/calypso/pdf-page-extract

引用

main

文件结构

📄 SKILL.md