技能 pdf-page-extract
📄

pdf-page-extract

安全 ⚡ 包含脚本

Extract PDF text spans and rendered images

Extract detailed text and visual data from PDF pages. This skill captures font metadata, text positions, and rendered images to enable accurate AI-driven HTML generation workflows.

支持: Claude Codex Code(CC)
📊 69 充足
1

下载技能 ZIP

2

在 Claude 中上传

前往 设置 → 功能 → 技能 → 上传技能

3

开启并开始使用

测试它

正在使用“pdf-page-extract”。 Extract chapter 2 from the PDF, pages 15-28

预期结果:

  • Created rich_extraction.json with 150 text spans across 14 pages
  • Generated 14 PNG images at 300 DPI resolution
  • Page mapping saved to analysis/page_mapping.json with 14 entries
  • Extracted 3 embedded images from pages 17, 22, and 25
  • All artifacts saved to output/chapter_02/page_artifacts/

正在使用“pdf-page-extract”。 Extract embedded images from chapter 3, pages 5-10

预期结果:

  • Found and extracted 8 embedded images across 6 pages
  • Saved images to output/chapter_03/images/ with metadata
  • Identified 2 pages as image-only requiring OCR
  • Font analysis found 5 unique font styles in text regions

安全审计

安全
v5 • 1/16/2026

This is a pure documentation skill with no executable code. The SKILL.md contains only instructions for running external Python scripts (PyMuPDF, pdfplumber) for PDF extraction. All 53 static findings are false positives: hash values flagged as weak crypto, Python script invocations flagged as shell execution, relative paths flagged as path traversal, and file checks flagged as reconnaissance. Commands use hardcoded paths with no user input - no injection risk. Legitimate document processing tool.

2
已扫描文件
476
分析行数
1
发现项
5
审计总数

风险因素

⚡ 包含脚本 (1)
审计者: claude 查看审计历史 →

质量评分

38
架构
100
可维护性
87
内容
21
社区
100
安全
83
规范符合性

你能构建什么

Convert PDF documents to HTML

Extract structured data from PDF documents to enable AI-powered HTML generation for web publishing.

Preserve document formatting

Capture font sizes, styles, and layouts from PDFs to recreate document formatting in other formats.

Analyze PDF document structure

Extract text and metadata from PDFs for content analysis, auditing, or data extraction pipelines.

试试这些提示

Basic extraction
Extract all pages from chapter 3 of the PDF. Run rich_extractor.py to get text spans with font metadata, render each page to PNG at 300 DPI, and create the page mapping.
Specific page range
Extract pages 15 through 28 from the PDF. Use read_page_footers.py first to establish the page mapping, then run rich_extractor.py for the specified range, and render each page to PNG.
Image extraction
Extract all embedded images from pages 5-10 in chapter 2. Run extract_images.py for each page and save the images with their metadata to the output directory.
Complete chapter pipeline
Run the complete extraction pipeline for chapter 4: first establish page mapping using read_page_footers.py, then run rich_extractor.py to extract all text spans with metadata, render each page to high-resolution PNG, and extract any embedded images. Verify all output files are valid.

最佳实践

  • Verify PDF file path and accessibility before starting extraction
  • Run page mapping first to ensure correct PDF-to-book page correlation
  • Check output directories exist before running extraction commands
  • Validate JSON and PNG files after extraction for completeness

避免

  • Running extraction without first establishing page mapping
  • Skipping validation of output files after extraction
  • Extracting without specifying the correct output directory structure

常见问题

What Python libraries are required?
PyMuPDF (fitz) and pdfplumber are required. Install with pip install pymupdf pdfplumber.
Can this process encrypted PDFs?
No. The PDF must be readable and not password protected for text extraction to work.
How does page mapping work?
read_page_footers.py scans footer text to create a mapping between PDF indices and book page numbers.
Is the extracted data safe?
Yes. All extraction runs locally on your machine. No data is sent to external services.
What if text extraction returns empty results?
The page may be image-only. The skill marks such pages with page_type: image_only for OCR processing.
How is this different from pdf-text-extract skill?
This skill extracts rich metadata including font sizes, positions, and renders PNG images for visual reference.

开发者详情

文件结构

📄 SKILL.md