技能 crawl4ai

🕷️

crawl4ai

Name: crawl4ai
Author: smallnest

低风险 📁 文件系统访问🌐 网络访问⚡ 包含脚本

抓取网站并提取结构化数据

也可从以下获取: CK991357

Crawl4AI 支持使用 JavaScript、基于模式的提取和灵活的输出格式进行高效的网页抓取。用户可以在无需 LLM 调用的情况下提取数据以实现经济高效的自动化，或使用 LLM 驱动的提取来处理复杂内容。

支持: Claude Codex Code(CC)

🥇 84 黄金

下载技能 ZIP

在 Claude 中上传

前往设置 → 功能 → 技能 → 上传技能

开启并开始使用

测试它

正在使用“crawl4ai”。 Crawl https://docs.python.org/3/ and extract the installation instructions

预期结果:

## Installation Instructions
- Download Python from python.org
- Run the installer
- Add Python to PATH
Source: https://docs.python.org/3/

正在使用“crawl4ai”。 Extract all article titles and links from a blog listing page

预期结果:

Extracted 15 articles:
- 'Getting Started with Python' → https://blog.example.com/python-start
- 'Advanced Patterns' → https://blog.example.com/advanced
- 'Best Practices' → https://blog.example.com/best-practices

正在使用“crawl4ai”。 Crawl a dynamic page with infinite scroll

预期结果:

Waited 3 seconds for content to load
Found 50 product cards
Extracted names, prices, and images for all products

安全审计

低风险

v3 • 1/17/2026

Static analysis flagged 2290 issues but 99% are false positives from markdown documentation. Actual Python code shows legitimate web crawler functionality with user-controlled URLs, explicit credential configuration, and standard file output operations. No hidden data exfiltration or malicious patterns found.

已扫描文件

9,145

分析行数

发现项

审计总数

风险因素

📁 文件系统访问 (3)

scripts/basic_crawler.py:54-67 scripts/extraction_pipeline.py:103 scripts/google_search.py:300

🌐 网络访问 (2)

scripts/google_search.py:35 scripts/basic_crawler.py:41-44

⚡ 包含脚本 (1)

tests/run_all_tests.py:15-18

审计者: claude 查看审计历史 →

质量评分

架构

100

可维护性

内容

社区

安全

100

规范符合性

你能构建什么

构建数据管道

从网站提取结构化数据以用于分析和报告工作流。

文档化网站

将文档站点转换为 markdown 以供离线阅读或迁移。

聚合网络内容

从多个来源收集和过滤内容以进行研究分析。

试试这些提示

基本抓取

Crawl this URL and return the main content as markdown: https://example.com

提取数据

Extract product names, prices, and links from this e-commerce page using CSS selectors.

处理 JavaScript

Crawl this JavaScript-heavy page and wait for the dynamic content to load before extracting.

批量处理

Crawl these three URLs in parallel and extract the main headlines from each: https://news1.com, https://news2.com, https://news3.com

最佳实践

对于重复性网站使用基于模式的 CSS 提取以避免 LLM 成本
为 JavaScript 密集型页面设置适当的超时和等待条件
遵守速率限制并在开发过程中使用缓存以减少负载

避免

当 CSS 选择器可以工作时使用 LLM 提取（成本更高）
抓取时没有正确的超时设置（可能会无限期挂起）
忽略目标网站的速率限制（可能会被阻止）

常见问题

什么是 crawl4ai？

一个具有 CLI 和 Python SDK 支持的网络爬虫和数据提取库。

我需要安装什么吗？

是的，运行：pip install crawl4ai and crawl4ai-setup

我可以在没有 LLM 的情况下提取数据吗？

是的，使用基于 CSS 选择器的提取，它更快且免费。

它能处理 JavaScript 页面吗？

是的，它使用浏览器并可以等待动态内容。

支持哪些输出格式？

Markdown、JSON、HTML 和提取的结构化数据。

如何处理身份验证？

配置 session_id 并在浏览器配置中提供凭据。

开发者详情

作者

smallnest

许可证

MIT

仓库

https://github.com/smallnest/crawl4ai-skill/tree/master/

引用

master

文件结构

📁 references/

📄 cli-guide.md

📄 complete-sdk-reference.md

📄 sdk-guide.md

📁 scripts/

📄 basic_crawler.py

📄 batch_crawler.py

📄 extraction_pipeline.py

📄 google_search.py

📁 tests/

📄 README.md

📄 run_all_tests.py

📄 test_advanced_patterns.py

📄 test_basic_crawling.py

📄 test_data_extraction.py

📄 test_markdown_generation.py

📄 README.md

📄 SKILL.md