
pinecone mcp helper
用于将 Git 仓库及关联网页内容处理并存储到 Pinecone 向量数据库的自动化工具。
Repository Info
About This Server
用于将 Git 仓库及关联网页内容处理并存储到 Pinecone 向量数据库的自动化工具。
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
Git Repository to Pinecone Ingestion Pipeline
This Python script automates the process of ingesting content from a Git repository and associated web links into a Pinecone vector database. The script clones a specified repository, processes its content using Repomix, identifies external links, scrapes/searches those links using Firecrawl, embeds all collected text content into vectors, and finally upserts these vectors into a structured Pinecone index.
Features
- Clone or update Git repositories
- Process repository content using Repomix
- Extract URLs from repository content
- Scrape web content using Firecrawl
- Generate vector embeddings for text content
- Upsert vectors to Pinecone with appropriate metadata
- Comprehensive error handling and logging
⚠️ Note: Firecrawl search and deep research features are still in testing and may not work as expected. Results may vary depending on your environment and API access.
Requirements
- Python 3.8+
- Git
- Repomix
- Pinecone API key
- Firecrawl API key
- Embedding API key (optional, depending on the embedding model)
Project Structure
.
├── config.yaml # Configuration file
├── main.py # Main entry point
├── requirements.txt # Python dependencies
├── pytest.ini # Pytest configuration
├── repo_ingestion/ # Main package
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── pipeline.py # Pipeline integration
│ ├── config/ # Configuration handling
│ │ ├── __init__.py
│ │ └── config_loader.py
│ ├── embedding/ # Embedding functionality
│ │ ├── __init__.py
│ │ └── embedder.py
│ ├── firecrawl/ # Firecrawl integration
│ │ ├── __init__.py
│ │ └── crawler.py
│ ├── git/ # Git repository management
│ │ ├── __init__.py
│ │ └── repo_manager.py
│ ├── pinecone/ # Pinecone integration
│ │ ├── __init__.py
│ │ └── index_manager.py
│ ├── repomix/ # Repomix processing
│ │ ├── __init__.py
│ │ └── processor.py
│ └── utils/ # Utility functions
│ ├── __init__.py
│ └── logging_utils.py
└── tests/ # Unit tests
├── __init__.py
├── test_config_loader.py
├── test_firecrawl_crawler.py
├── test_repo_manager.py
└── test_repomix_processor.py
Installation
-
Clone this repository:
git clone <repository-url> cd <repository-directory> -
Install the required dependencies:
pip install -r requirements.txt -
Set up the required environment variables:
export PINECONE_API_KEY=your_pinecone_api_key export PINECONE_ENVIRONMENT=your_pinecone_environment export FIRECRAWL_API_KEY=your_firecrawl_api_key export EMBEDDING_API_KEY=your_embedding_api_key # Optional, depending on the embedding modelAlternatively, you can create a
.envfile in the project root with these variables:PINECONE_API_KEY=your_pinecone_api_key PINECONE_ENVIRONMENT=your_pinecone_environment FIRECRAWL_API_KEY=your_firecrawl_api_key EMBEDDING_API_KEY=your_embedding_api_key
Configuration
The script uses a YAML configuration file (config.yaml) to specify parameters for the pipeline. Here's an example configuration:
pinecone:
dimension: 1536 # Vector dimension for the Pinecone index
metric: cosine # Similarity metric (cosine, dotproduct, euclidean)
embedding:
model: multilingual-e5-large # Embedding model to use
Configuration Options
Pinecone Configuration
pinecone.dimension: Integer specifying the vector dimension for the Pinecone index (e.g., 1536)pinecone.metric: String specifying the similarity metric for the Pinecone index (e.g.,cosine,dotproduct,euclidean)
Embedding Configuration
embedding.model: String specifying the identifier or name of the embedding model to use. Supported models:multilingual-e5-large: Multilingual dense embedding modelllama-text-embed-v2: High-performance dense embedding modelpinecone-sparse-english-v0: Sparse embedding model for keyword or hybrid search
Usage
Run the script with a Git repository URL or local path:
python main.py https://github.com/user/repo.git
Command-line Options
python main.py --help
Available options:
repo_url: URL of the Git repository to ingest or path to a local Git repository (required positional argument)--config: Path to the YAML configuration file (default:config.yaml)--log-level: Set the logging level (choices:DEBUG,INFO,WARNING,ERROR,CRITICAL, default:INFO)--log-file: Path to the log file (if not provided, logs will only be written to the console)
Pipeline Architecture
The script follows a sequential pipeline architecture:
- Initialization & Input: Receive the Git repository URL and load configuration.
- Repository Management: Clone or update the target repository.
- Repomix Processing: Run Repomix on the local repository and capture its output.
- Repomix Output Parsing & Chunking: Read, parse, and chunk the Repomix output.
- Embedding Preparation: Configure the embedding model based on settings.
- Pinecone Index Management: Ensure the target Pinecone index exists (create if necessary).
- Repomix Data Embedding & Ingestion: Embed Repomix chunks and upsert to Pinecone.
- Firecrawl URL Identification: Extract URLs from the Repomix output.
- Firecrawl Processing: Use Firecrawl SDK to scrape/search extracted URLs.
- Firecrawl Data Embedding & Ingestion: Chunk and embed Firecrawl results and upsert to the same Pinecone index.
- Completion/Reporting: Indicate success or failure.
Data Flow
- Repository Content: Processed by Repomix and stored in an XML file
- Content Chunks: Extracted from the XML file with metadata (file path, source type)
- Embeddings: Generated for each content chunk using the configured embedding model
- Vectors: Upserted to Pinecone with metadata for traceability
- URLs: Extracted from repository content and processed by Firecrawl
- Web Content: Scraped from URLs, chunked, embedded, and upserted to Pinecone
Error Handling
The script implements robust error handling for various stages of the pipeline:
- CLI argument parsing
- Configuration loading
- Environment variable validation
- Git operations
- Repomix execution and output parsing
- Embedding
- Pinecone API calls
- Firecrawl API calls
- File system operations
Testing
The project includes comprehensive unit tests for all components. To run the tests:
python -m pytest
To run tests with verbose output:
python -m pytest -v
To run a specific test file:
python -m pytest tests/test_config_loader.py
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit your changes:
git commit -m 'Add some feature' - Push to the branch:
git push origin feature-name - Submit a pull request
License
MIT License
Quick Start
Clone the repository
git clone https://github.com/decision-crafters/pinecone-mcp-helperInstall dependencies
cd pinecone-mcp-helper
npm installFollow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
Recommended MCP Servers
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
Zapier MCP
A Zapier server that provides automation capabilities for various apps.