pdf-page-extract
Extract PDF text spans and rendered images
Extract detailed text and visual data from PDF pages. This skill captures font metadata, text positions, and rendered images to enable accurate AI-driven HTML generation workflows.
下载技能 ZIP
在 Claude 中上传
前往 设置 → 功能 → 技能 → 上传技能
开启并开始使用
测试它
正在使用“pdf-page-extract”。 Extract chapter 2 from the PDF, pages 15-28
预期结果:
- Created rich_extraction.json with 150 text spans across 14 pages
- Generated 14 PNG images at 300 DPI resolution
- Page mapping saved to analysis/page_mapping.json with 14 entries
- Extracted 3 embedded images from pages 17, 22, and 25
- All artifacts saved to output/chapter_02/page_artifacts/
正在使用“pdf-page-extract”。 Extract embedded images from chapter 3, pages 5-10
预期结果:
- Found and extracted 8 embedded images across 6 pages
- Saved images to output/chapter_03/images/ with metadata
- Identified 2 pages as image-only requiring OCR
- Font analysis found 5 unique font styles in text regions
安全审计
安全This is a pure documentation skill with no executable code. The SKILL.md contains only instructions for running external Python scripts (PyMuPDF, pdfplumber) for PDF extraction. All 53 static findings are false positives: hash values flagged as weak crypto, Python script invocations flagged as shell execution, relative paths flagged as path traversal, and file checks flagged as reconnaissance. Commands use hardcoded paths with no user input - no injection risk. Legitimate document processing tool.
风险因素
⚡ 包含脚本 (1)
质量评分
你能构建什么
Convert PDF documents to HTML
Extract structured data from PDF documents to enable AI-powered HTML generation for web publishing.
Preserve document formatting
Capture font sizes, styles, and layouts from PDFs to recreate document formatting in other formats.
Analyze PDF document structure
Extract text and metadata from PDFs for content analysis, auditing, or data extraction pipelines.
试试这些提示
Extract all pages from chapter 3 of the PDF. Run rich_extractor.py to get text spans with font metadata, render each page to PNG at 300 DPI, and create the page mapping.
Extract pages 15 through 28 from the PDF. Use read_page_footers.py first to establish the page mapping, then run rich_extractor.py for the specified range, and render each page to PNG.
Extract all embedded images from pages 5-10 in chapter 2. Run extract_images.py for each page and save the images with their metadata to the output directory.
Run the complete extraction pipeline for chapter 4: first establish page mapping using read_page_footers.py, then run rich_extractor.py to extract all text spans with metadata, render each page to high-resolution PNG, and extract any embedded images. Verify all output files are valid.
最佳实践
- Verify PDF file path and accessibility before starting extraction
- Run page mapping first to ensure correct PDF-to-book page correlation
- Check output directories exist before running extraction commands
- Validate JSON and PNG files after extraction for completeness
避免
- Running extraction without first establishing page mapping
- Skipping validation of output files after extraction
- Extracting without specifying the correct output directory structure
常见问题
What Python libraries are required?
Can this process encrypted PDFs?
How does page mapping work?
Is the extracted data safe?
What if text extraction returns empty results?
How is this different from pdf-text-extract skill?
开发者详情
作者
AbeJitsu许可证
MIT
仓库
https://github.com/AbeJitsu/Game-Settings-Panel/tree/main/.claude/skills/calypso/pdf-page-extract引用
main
文件结构
📄 SKILL.md