Skills web-scrape

🕸️

web-scrape

Name: web-scrape
Author: 21pounder

Safe

Extract clean content from any webpage

Also available from: 21pounder

Web scraping is time-consuming and error-prone when done manually. This skill uses intelligent content extraction to pull clean, structured content from any URL in seconds. It handles dynamic pages, removes noise like ads and navigation, and outputs in markdown, JSON, or plain text.

Supports: Claude Codex Code(CC)

📊 70 Adequate

Download the skill ZIP

Upload in Claude

Go to Settings → Capabilities → Skills → Upload skill

Toggle on and start using

Test it

Using "web-scrape". Scrape https://example.com/blog/post-title as markdown

Expected outcome:

# How to Build a REST API
**Source:** https://example.com/blog/post-title
**Date:** January 10, 2025
**Author:** Jane Developer
---
REST APIs are the backbone of modern web applications...
## Getting Started
First, install your preferred HTTP client...

Security Audit

Safe

v3 • 1/10/2026

This skill is a prompt-based wrapper that uses MCP Playwright tools for browser automation. The supporting Node.js script (html_clean.js) performs safe HTML-to-markdown conversion using standard libraries (cheerio, turndown) with stdin/stdout I/O only. No network calls, file writes, command execution, or sensitive data access. Security guidelines explicitly prohibit dangerous behaviors like executing page JavaScript or handling authentication.

Files scanned

306

Lines analyzed

findings

Total audits

No security issues found

Audited by: claude View Audit History →

Quality Score

Architecture

100

Maintainability

Content

Community

100

Security

Spec Compliance

What You Can Build

Research data gathering

Extract article content, documentation, and research papers from multiple sources into structured notes

API documentation capture

Save API docs and technical content for offline reference or integration work

Content aggregation

Collect and curate content from multiple web sources for analysis or inspiration

Try These Prompts

Basic page scrape

Scrape https://example.com/article and return the content as markdown

Product data extraction

Extract product information from https://shop.example.com/product as JSON with title, price, and description

Multi-page documentation

Scrape the documentation at https://docs.example.com/getting-started. Check if there are multiple pages and ask if you should continue

Visual capture

Navigate to https://example.com and take a full-page screenshot saved as example_page.png

Best Practices

Start with the simplest scrape command and add options like --scroll or --screenshot only when needed
Review the extracted content for accuracy, especially for complex pages with dynamic elements
Respect website terms of service and robots.txt when scraping content

Avoid

Do not use this skill to scrape login-protected or subscription-only content without authorization
Do not attempt to bypass CAPTCHAs or access restrictions—this will fail and waste resources
Do not scrape high-frequency or real-time data without appropriate rate limiting

Frequently Asked Questions

What platforms is this skill compatible with?

Works with Claude, Codex, and Claude Code when Playwright MCP is configured.

What are the rate limits?

Limits depend on your Playwright MCP server configuration and target website policies.

Can I integrate this with other tools?

Yes, use the JSON output format for structured data that integrates with workflows.

Is my scraping activity tracked?

Activity stays local—only your Playwright instance and target server see the requests.

Why did my scrape fail?

Common causes include timeout, 403/404 errors, CAPTCHAs, or JavaScript-heavy pages that need scroll options.

How is this different from curl or wget?

This skill renders JavaScript, handles dynamic content, extracts clean text, and provides structured outputs automatically.

Developer Details

Author

21pounder

License

MIT

Repository

https://github.com/21pounder/terminalAgent/tree/main/deepresearch/.claude/skills/web-scrape

Ref

main

File structure

📁 scripts/

📄 html_clean.js

📄 SKILL.md