# @lukaszraczylo/cloudflare-crawl-mcp
MCP server for crawling websites using Cloudflare Browser Rendering API. Supports multiple output formats including Markdown, HTML, and JSON. ## Features - **Multiple Output Formats**: Choose between Markdown, HTML, or JSON output - **Configurable Crawling**: Control depth, page limits, and link following - **Pattern Filtering**: Include/exclude URLs using wildcard patterns - **JavaScript Rendering**: Execute JavaScript for dynamic content (or disable for static content) - **Environment-Based Secrets**: Securely manage credentials via environment variables ## Prerequisites - Node.js 18+ - Cloudflare account with Browser Rendering API access - Cloudflare API Token with `Browser Rendering` permissions - Cloudflare Account ID ## Quick Start ```bash # Clone and setup npm install npm run build # Run with environment variables CF_API_TOKEN=your_token CF_ACCOUNT_ID=your_account_id npm start ``` ## Installation ### 1. Clone the Repository ```bash git clone https://github.com/lukaszraczylo/cloudflare-crawl-mcp.git cd cloudflare-crawl-mcp ``` ### 2. Install Dependencies ```bash npm install ``` ### 3. Build the Server ```bash npm run build ``` ### 4. Configure Environment Variables Copy the example environment file and add your credentials: ```bash cp .env.example .env ``` Edit `.env` with your Cloudflare credentials: ``` CF_API_TOKEN=your_cloudflare_api_token CF_ACCOUNT_ID=your_cloudflare_account_id ``` #### Getting Cloudflare Credentials 1. **Account ID**: Find it at https://dash.cloudflare.com/_/account 2. **API Token**: Create one at https://dash.cloudflare.com/profile/api-tokens with these permissions: - `Account` > `Browser Rendering` > `Edit` ## MCP Configuration ### Claude Desktop (macOS) Add to `~/Library/Application Support/Claude/claude_desktop_config.json`: ```json { "mcpServers": { "cloudflare-crawl": { "command": "npm", "args": ["start"], "env": { "CF_API_TOKEN": "your_api_token", "CF_ACCOUNT_ID": "your_account_id" }, "path": "/path/to/cloudflare-crawl-mcp" } } } ``` ### Claude Code (CLI) ```json { "mcpServers": { "cloudflare-crawl": { "command": "npm", "args": ["start"], "env": { "CF_API_TOKEN": "your_api_token", "CF_ACCOUNT_ID": "your_account_id" } } } } ``` ### Cursor Add to `~/.cursor/settings.json` (MCP configuration): ```json { "mcpServers": { "cloudflare-crawl": { "command": "npm", "args": ["start"], "env": { "CF_API_TOKEN": "your_api_token", "CF_ACCOUNT_ID": "your_account_id" }, "path": "/path/to/cloudflare-crawl-mcp" } } } ``` ## Available Tools ### crawl_url_markdown Crawl a website and return content in **Markdown** format. ```typescript { "name": "crawl_url_markdown", "arguments": { "url": "https://example.com/docs", "limit": 50, "depth": 2, "includePatterns": ["https://example.com/docs/**"], "excludePatterns": ["https://example.com/docs/archive/**"], "render": true } } ``` ### crawl_url_html Crawl a website and return content in **HTML** format. ```typescript { "name": "crawl_url_html", "arguments": { "url": "https://example.com", "limit": 10 } } ``` ### crawl_url_json Crawl a website and return content in **JSON** format (uses Workers AI for data extraction). ```typescript { "name": "crawl_url_json", "arguments": { "url": "https://example.com/products", "limit": 20 } } ``` ## Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `url` | string | required | Starting URL to crawl | | `limit` | number | 10 | Maximum pages to crawl (max: 100,000) | | `depth` | number | 1 | Maximum link depth from starting URL | | `includeSubdomains` | boolean | false | Follow links to subdomains | | `includeExternalLinks` | boolean | false | Follow links to external domains | | `includePatterns` | string[] | [] | Wildcard patterns to include | | `excludePatterns` | string[] | [] | Wildcard patterns to exclude | | `render` | boolean | true | Execute JavaScript (false = faster static fetch) | ### Pattern Syntax - `*` - Matches any characters except `/` - `**` - Matches any characters including `/` Examples: - `https://example.com/docs/**` - All URLs under /docs - `https://example.com/*.html` - All HTML files directly in root ## Development ### Commands ```bash npm run build # Build TypeScript npm start # Run server npm test # Run tests npm run test:watch # Run tests in watch mode ``` ### Testing The project includes comprehensive tests covering: - Environment variable handling - Crawl options building - Result formatting (Markdown, HTML, JSON) - Error handling - API integration Run tests: ```bash npm test ``` ## Architecture ``` src/ ├── index.ts # Main MCP server implementation │ ├── API Layer │ ├── initiateCrawl() # POST to /crawl endpoint │ ├── waitForCrawl() # Poll for job completion │ └── getCrawlResults() # Fetch final results │ ├── Formatters │ ├── formatMarkdownResult() │ ├── formatHtmlResult() │ └── formatJsonResult() │ └── MCP Handlers ├── ListToolsRequestSchema # Tool registration └── CallToolRequestSchema # Tool execution ``` ## Cloudflare Limits - **Max crawl duration**: 7 days - **Results available**: 14 days after completion - **Max pages per job**: 100,000 - **Free plan**: 10 minutes of browser time per day See [Cloudflare Browser Rendering Limits](https://developers.cloudflare.com/browser-rendering/limits/) for details. ## Troubleshooting ### Crawl returns no results - Check `robots.txt` blocking (use `render: false` to bypass) - Verify `includePatterns` match actual URLs - Try increasing `depth` or disabling pattern filters ### Job cancelled due to limits - Upgrade to Workers Paid plan - Use `render: false` for static content - Reduce `limit` parameter ### Authentication errors - Verify API Token has Browser Rendering permissions - Confirm Account ID is correct ## License MIT License - see [LICENSE](LICENSE) file. ## Contributing Contributions are welcome! Please read our contributing guidelines before submitting PRs at https://github.com/lukaszraczylo/cloudflare-crawl-mcp. ## Support - Open an issue at https://github.com/lukaszraczylo/cloudflare-crawl-mcp/issues - Check Cloudflare's [Browser Rendering Docs](https://developers.cloudflare.com/browser-rendering/) for API details