An MCP server that fetches web page content using Playwright headless browser with intelligent content extraction and parallel processing.
Fetcher MCP for Web Content Extraction
Overview
Fetcher MCP is a powerful server designed to fetch web page content using the Playwright headless browser. It excels in handling dynamic web content and modern web applications, making it an ideal tool for web scraping and content extraction tasks.
Advantages
- JavaScript Support: Executes JavaScript to handle dynamic web content.
- Intelligent Content Extraction: Uses a Readability algorithm to extract main content.
- Flexible Output Format: Supports HTML and Markdown output formats.
- Parallel Processing: Enables concurrent fetching of multiple URLs.
- Resource Optimization: Blocks unnecessary resources to reduce bandwidth usage.
- Robust Error Handling: Comprehensive error handling and logging.
- Configurable Parameters: Fine-grained control over timeouts and content extraction.
Quick Start
Run directly with npx:
npx -y fetcher-mcp
First time setup - install the required browser:
npx playwright install chromium
Debug Mode
Run with the --debug
option to show the browser window for debugging:
npx -y fetcher-mcp --debug
Configuration MCP
Configure this MCP server in Claude Desktop:
On MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%/Claude/claude_desktop_config.json
{
"mcpServers": {
"fetcher": {
"command": "npx",
"args": ["-y", "fetcher-mcp"]
}
}
}
Features
- fetch_url: Retrieve web page content from a specified URL.
- Parameters:
url
: The URL of the web page to fetch.
timeout
: Page loading timeout in milliseconds.
waitUntil
: Specifies when navigation is considered complete.
extractContent
: Whether to intelligently extract the main content.
maxLength
: Maximum length of returned content.
returnHtml
: Whether to return HTML content instead of Markdown.
waitForNavigation
: Whether to wait for additional navigation.
navigationTimeout
: Maximum time to wait for additional navigation.
disableMedia
: Whether to disable media resources.
debug
: Whether to enable debug mode.
- fetch_urls: Batch retrieve web page content from multiple URLs in parallel.
- Parameters:
urls
: Array of URLs to fetch.
- Other parameters are the same as
fetch_url
.
Tips
Handling Special Website Scenarios
Dealing with Anti-Crawler Mechanisms
- Wait for Complete Loading: For websites using CAPTCHA, redirects, or other verification mechanisms.
- Increase Timeout Duration: For websites that load slowly.
Content Retrieval Adjustments
- Preserve Original HTML Structure: When content extraction might fail.
- Fetch Complete Page Content: When extracted content is too limited.
- Return Content as HTML: When HTML format is needed instead of default Markdown.
Debugging and Authentication
Enabling Debug Mode
- Dynamic Debug Activation: To display the browser window during a specific fetch operation.
Using Custom Cookies for Authentication
- Manual Login: To login using your own credentials.
- Interacting with Debug Browser: When debug mode is enabled.
- Enable Debug for Specific Requests: Even if the server is already running, you can enable debug mode for a specific request.
Development
Install Dependencies
npm install
Install Playwright Browser
npm run install-browser
Build the Server
npm run build
Debugging
Use MCP Inspector for debugging:
npm run inspector
You can also enable visible browser mode for debugging:
node build/index.js --debug
Related Projects
- g-search-mcp: A powerful MCP server for Google search that enables parallel searching with multiple keywords simultaneously.
License
Licensed under the MIT License
About
MCP server for fetch web page content using Playwright headless browser.
Topics
ai mcp playwright
Resources
Readme
License
MIT license
Code of conduct
Code of conduct
Activity
Stars
446 stars
Watchers
4 watching
Forks
27 forks
Report repository
Releases
No releases published
Packages 0
No packages published
Languages