The Web Scraping Integration for Claude is an MCP server designed to enhance Claude's capabilities by allowing it to scrape and transcribe content from webpages, YouTube videos, and PDFs. By providing a URL, Claude can extract and utilize the text content, enabling it to answer questions or perform tasks based on the provided links.
get_pdf
Converts a URL that leads to a PDF file into markdown text.
Args:
- input_url
(str): Path to the PDF file to convert.
Returns:
- str
: markdown_text
get_webpage_content
Returns the text content on a webpage based on the link provided. This tool is useful for accessing and extracting text from general webpages.
Args:
- url
: The URL from which you want the text to be extracted.
get_youtube_transcript
Extracts the transcript from a YouTube video. This tool is particularly useful when users provide YouTube links and ask questions based on the video content.
Args:
- url
: The URL from which you want the text to be extracted.
To set up the Web Scraping Integration for Claude, follow these steps:
Clone the Repository:
bash
git clone https://github.com/saishridhar/webscraper.git
Install Dependencies:
bash
pip install -r requirements.txt
Run the Server:
bash
./run_webscraper.sh
Once the server is running, you can integrate it with Claude to enable web scraping capabilities. Simply provide the URL of the webpage, YouTube video, or PDF, and Claude will be able to extract and utilize the text content.
Extracting Webpage Content:
python
from webscraper import get_webpage_content
content = get_webpage_content("https://example.com")
print(content)
Extracting YouTube Transcript:
python
from webscraper import get_youtube_transcript
transcript = get_youtube_transcript("https://www.youtube.com/watch?v=example")
print(transcript)
Converting PDF to Markdown:
python
from webscraper import get_pdf
markdown_text = get_pdf("https://example.com/document.pdf")
print(markdown_text)
This MCP server is designed to transcribe webpages for LLMs like Claude, enabling them to access and utilize content from various sources by simply providing the URL. This integration enhances Claude's ability to interact with and respond to user queries based on external content.