์Šคํ‚ฌ web-scrape
๐Ÿ•ธ๏ธ

web-scrape

์•ˆ์ „

Extract clean content from any webpage

๋˜ํ•œ ๋‹ค์Œ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: 21pounder

Web scraping is time-consuming and error-prone when done manually. This skill uses intelligent content extraction to pull clean, structured content from any URL in seconds. It handles dynamic pages, removes noise like ads and navigation, and outputs in markdown, JSON, or plain text.

์ง€์›: Claude Codex Code(CC)
๐Ÿ“Š 70 ์ ์ ˆํ•จ
1

์Šคํ‚ฌ ZIP ๋‹ค์šด๋กœ๋“œ

2

Claude์—์„œ ์—…๋กœ๋“œ

์„ค์ • โ†’ ๊ธฐ๋Šฅ โ†’ ์Šคํ‚ฌ โ†’ ์Šคํ‚ฌ ์—…๋กœ๋“œ๋กœ ์ด๋™

3

ํ† ๊ธ€์„ ์ผœ๊ณ  ์‚ฌ์šฉ ์‹œ์ž‘

ํ…Œ์ŠคํŠธํ•ด ๋ณด๊ธฐ

"web-scrape" ์‚ฌ์šฉ ์ค‘์ž…๋‹ˆ๋‹ค. Scrape https://example.com/blog/post-title as markdown

์˜ˆ์ƒ ๊ฒฐ๊ณผ:

  • # How to Build a REST API
  • **Source:** https://example.com/blog/post-title
  • **Date:** January 10, 2025
  • **Author:** Jane Developer
  • ---
  • REST APIs are the backbone of modern web applications...
  • ## Getting Started
  • First, install your preferred HTTP client...

๋ณด์•ˆ ๊ฐ์‚ฌ

์•ˆ์ „
v3 โ€ข 1/10/2026

This skill is a prompt-based wrapper that uses MCP Playwright tools for browser automation. The supporting Node.js script (html_clean.js) performs safe HTML-to-markdown conversion using standard libraries (cheerio, turndown) with stdin/stdout I/O only. No network calls, file writes, command execution, or sensitive data access. Security guidelines explicitly prohibit dangerous behaviors like executing page JavaScript or handling authentication.

2
์Šค์บ”๋œ ํŒŒ์ผ
306
๋ถ„์„๋œ ์ค„ ์ˆ˜
0
๋ฐœ๊ฒฌ ์‚ฌํ•ญ
3
์ด ๊ฐ์‚ฌ ์ˆ˜
๋ณด์•ˆ ๋ฌธ์ œ๋ฅผ ์ฐพ์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค

ํ’ˆ์งˆ ์ ์ˆ˜

45
์•„ํ‚คํ…์ฒ˜
100
์œ ์ง€๋ณด์ˆ˜์„ฑ
83
์ฝ˜ํ…์ธ 
26
์ปค๋ฎค๋‹ˆํ‹ฐ
100
๋ณด์•ˆ
78
์‚ฌ์–‘ ์ค€์ˆ˜

๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ฒƒ

Research data gathering

Extract article content, documentation, and research papers from multiple sources into structured notes

API documentation capture

Save API docs and technical content for offline reference or integration work

Content aggregation

Collect and curate content from multiple web sources for analysis or inspiration

์ด ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ๋ณด์„ธ์š”

Basic page scrape
Scrape https://example.com/article and return the content as markdown
Product data extraction
Extract product information from https://shop.example.com/product as JSON with title, price, and description
Multi-page documentation
Scrape the documentation at https://docs.example.com/getting-started. Check if there are multiple pages and ask if you should continue
Visual capture
Navigate to https://example.com and take a full-page screenshot saved as example_page.png

๋ชจ๋ฒ” ์‚ฌ๋ก€

  • Start with the simplest scrape command and add options like --scroll or --screenshot only when needed
  • Review the extracted content for accuracy, especially for complex pages with dynamic elements
  • Respect website terms of service and robots.txt when scraping content

ํ”ผํ•˜๊ธฐ

  • Do not use this skill to scrape login-protected or subscription-only content without authorization
  • Do not attempt to bypass CAPTCHAs or access restrictionsโ€”this will fail and waste resources
  • Do not scrape high-frequency or real-time data without appropriate rate limiting

์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ

What platforms is this skill compatible with?
Works with Claude, Codex, and Claude Code when Playwright MCP is configured.
What are the rate limits?
Limits depend on your Playwright MCP server configuration and target website policies.
Can I integrate this with other tools?
Yes, use the JSON output format for structured data that integrates with workflows.
Is my scraping activity tracked?
Activity stays localโ€”only your Playwright instance and target server see the requests.
Why did my scrape fail?
Common causes include timeout, 403/404 errors, CAPTCHAs, or JavaScript-heavy pages that need scroll options.
How is this different from curl or wget?
This skill renders JavaScript, handles dynamic content, extracts clean text, and provides structured outputs automatically.

๊ฐœ๋ฐœ์ž ์„ธ๋ถ€ ์ •๋ณด

์ž‘์„ฑ์ž

21pounder

๋ผ์ด์„ ์Šค

MIT

์ฐธ์กฐ

main

ํŒŒ์ผ ๊ตฌ์กฐ

๐Ÿ“ scripts/

๐Ÿ“„ html_clean.js

๐Ÿ“„ SKILL.md