An MCP server that enables AI assistants to search and access private documents, codebases, and tech info by processing Markdown, text, and PDFs into a searchable database.
AI Assistant Document Integration Server
Overview
The AI Assistant Document Integration Server extends the capabilities of AI assistants by enabling them to access and search private documents, codebases, and up-to-date technical information. It processes Markdown, text, and PDF files into a searchable database, allowing AI models to retrieve information beyond their training data. Built with Docker, it supports both free and paid embeddings, ensuring AI assistants stay updated with your data.
Key Features
- Document Processing: Converts Markdown, text, and PDF files into a searchable database.
- Model Context Protocol (MCP): Implements the MCP standard to allow AI assistants to query external data sources.
- Up-to-Date Knowledge: Overcomes LLM knowledge cutoffs by integrating the latest framework documentation, private codebases, and technical specifications.
- Flexible Embedding Models: Supports both free local embeddings and paid OpenAI embeddings.
- Docker Integration: Easy setup and deployment using Docker containers.
Architecture
The system consists of two main components:
1. Processing Pipeline: Reads, chunks, and generates embeddings for documents, storing them in a vector database.
2. MCP Server: Exposes processed content through MCP tools, enabling AI assistants to search and retrieve information.
Use Cases
- Latest Framework Documentation: Keep AI assistants updated with the latest React, Angular, or Vue documentation.
- Private Codebase Integration: Allow AI assistants to understand and debug proprietary code.
- Technical Specifications: Provide AI assistants with up-to-date API and protocol documentation.
Prerequisites
- Docker: Docker Desktop for Windows/Mac or Docker Engine for Linux.
- OpenAI API Key (Optional): Required for paid embedding models.
- MCP-Compatible AI Assistant: Such as Roo or other compatible assistants.
Setup
- Clone the repository:
shell
git clone https://github.com/donphi/mcp-server.git
cd mcp-server
- Create a
.env
file from the example:
shell
cp .env.example .env
nano .env
- Place your Markdown and text files in the
data/
directory.
Configuration
Configure the server using environment variables in the .env
file:
OPENAI_API_KEY=your_openai_api_key_here
CHUNK_SIZE=800
CHUNK_OVERLAP=120
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
Embedding Models
Free Models (No API Key Required)
- sentence-transformers/all-MiniLM-L6-v2: Compact model for sentence and paragraph encoding.
- BAAI/bge-m3: Supports multiple retrieval functionalities and over 100 languages.
- Snowflake/snowflake-arctic-embed-m: Optimized for high-quality retrieval.
Paid Models (Require OpenAI API Key)
- text-embedding-3-small: Cost-effective with good quality.
- text-embedding-3-large: Highest quality embeddings.
Usage
Processing Files
Run the pipeline to process files and generate embeddings:
docker-compose build pipeline
docker-compose run pipeline
Building the MCP Server
After processing documents, build the server:
docker-compose build server
Connecting to an AI Assistant
Generate the configuration file using the provided scripts:
- macOS/Linux:
shell
chmod +x setup-mcpServer-json.sh
./setup-mcpServer-json.sh
- Windows:
Double-click setup-mcpServer-json.bat
.
MCP Tools
The server exposes the following tools:
- read_md_files: Process and retrieve files.
- search_content: Search across processed content.
- get_context: Retrieve contextual information.
- project_structure: Provide project structure information.
- suggest_implementation: Generate implementation suggestions.
Supported File Types
- Markdown (.md)
- Text (.txt)
- PDF (.pdf)
- Word documents (.docx, .doc)
Troubleshooting
- Docker not found: Ensure Docker is installed and running.
- "Invalid reference format" error: Build the server image before running it.
- API key issues: Use free local embedding models without API keys.
- Chroma database not found: Run the pipeline to process documents first.
Advanced Configuration
Customize the pipeline and server for advanced use cases:
- Custom Embedding Functions: Modify embedding logic.
- Chunking Behavior: Adjust chunking parameters.
- Chunk Analysis: Compare standard and enhanced chunking methods.
License
This project is licensed under the MIT License.
Created with ❤️ by donphi