
webclaw is a web extraction engine built for LLMs and AI agents. You give it a URL, it returns clean, structured content. No headless browser required, no Selenium, no Puppeteer. Single binary, runs on your machine or as a cloud API.
THE PROBLEM IT SOLVES
Most scraping tools were built for data pipelines. They return raw HTML and leave the rest to you. That worked before LLMs became the consumer. Raw HTML is 50,000 to 200,000 tokens of markup for what is usually 800 tokens of actual content. And more and more of that content lives behind bot protection that turns your request into a 403 or a Cloudflare challenge page before you even get the HTML.
webclaw handles both problems.
HOW IT GETS PAST BOT PROTECTION
Most scrapers get blocked because their TLS handshake looks nothing like a real browser. Python requests, Node fetch, Go net/http — they all expose cipher suites, HTTP/2 settings, and header ordering that are trivially fingerprinted. Cloudflare checks this before your request reaches the server.
webclaw impersonates Chrome 146 at the TLS level using BoringSSL, Google's fork of OpenSSL and the same library Chrome itself uses. Cipher suite order, ALPN extensions, HTTP/2 frame settings, header ordering — all matched to a real browser profile. For pages that need JavaScript execution or CAPTCHA solving, webclaw has a full antibot engine that runs as a secondary path. Most requests never need it.
89% bypass rate on Cloudflare-protected sites. Cookie warmup fallback for Akamai. The fast path runs at plain HTTP latency. Chrome only spins up when the fast path fails.
OUTPUT
Four formats: markdown, llm, json, text.
The llm format runs a 9-step optimization pipeline: image stripping, link deduplication, whitespace collapse, stat merging, boilerplate removal. On a typical content page you go from 50,000 tokens of HTML to around 2,000 tokens of clean content. 67% fewer tokens on average than standard markdown conversion.
webclaw also extracts JavaScript data islands automatically. NEXT_DATA, window.PRELOADED_STATE, SvelteKit serialized data — content that lives in the DOM as JSON rather than rendered HTML. No headless browser needed for this. QuickJS executes the relevant inline scripts and surfaces the data as a structured_data field in the output.
WHAT IT HANDLES
Any URL: static HTML, React SPAs, Next.js, SvelteKit, server-side rendered pages. File types: PDF, DOCX, XLSX, CSV — auto-detected from Content-Type, extracted inline. Cloudflare, Akamai, and other WAF-protected sites via TLS fingerprinting and antibot fallback. Multi-page crawling: BFS same-origin with configurable depth, concurrency, sitemap seeding. Batch scraping: multiple URLs in parallel with proxy rotation. Change detection: snapshot a page, diff it later. Structured extraction: pass a JSON schema, get typed data back. Deep research: multi-step search, fetch, and synthesis for a given query.
INTEGRATION
CLI, REST API, and MCP server for Claude, Cursor, and Windsurf. 8 MCP tools available fully offline, full feature set with API key.
If you're already using Firecrawl, the /v2 compatibility layer accepts the same request format. Change the base URL and your existing SDK keeps working.
INSTALL
brew tap 0xMassi/webclaw && brew install webclaw cargo install webclaw docker run --rm ghcr.io/0xmassi/webclaw https://example.com
OPEN SOURCE
Core extraction engine, CLI, and MCP server are open source under AGPL-3.0. You can self-host everything. The cloud API adds the antibot engine, managed proxies, and hosted infrastructure.
GitHub: https://github.com/0xMassi/webclaw
No voters yet
No comments yet. Be the first to comment!