
chatbot llm server
基于 MCP 协议的大型语言模型推理服务器,支持 Gemma 3 等 GGUF 格式模型。
Repository Info
About This Server
基于 MCP 协议的大型语言模型推理服务器,支持 Gemma 3 等 GGUF 格式模型。
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
Chatbot LLM Server
A Model Control Protocol (MCP) server that provides Large Language Model (LLM) inference capabilities using llama.cpp through Python bindings. This server loads and serves Gemma 3 models (and other GGUF format models) for chat completion tasks.
Features
- MCP Integration: Fully compatible with Model Control Protocol for seamless integration
- CUDA Support: GPU acceleration for faster inference
- Lazy Loading: Model loads in background after MCP connection is established
- Tool Support: Built-in function calling capabilities
- Health Monitoring: Real-time status reporting and health checks
- Error Recovery: Graceful handling of model loading failures
Prerequisites
- Python 3.11 or higher
- NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)
- NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)
Installation
1. Create Virtual Environment
cd /path/to/chatbot-llm-server
python -m venv llm-env
# On Windows:
llm-env\Scripts\activate
# On Linux/Mac:
source llm-env/bin/activate
2. Install Dependencies
Basic MCP Support
pip install mcp
llama-cpp-python with CUDA Support
For CUDA 12.3 support (recommended):
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123
Or install a specific version if needed (Windows example):
pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.8-cu123/llama_cpp_python-0.3.8-cp311-cp311-win_amd64.whl --force-reinstall --no-cache-dir
Verify Installation
python -c "import llama_cpp; print(f'llama-cpp-python version: {llama_cpp.__version__}')"
python -c "import mcp; print('MCP installed successfully')"
3. Download Model
Place your GGUF model file in the models/ directory:
chatbot-llm-server/
├── models/
│ └── gemma-3-27b-it-Q4_K_M.gguf
├── llm_server.py
├── llm_engine.py
├── config.py
└── serverconfig.json
4. Configuration
Create config.py
# LLM Configuration
LLM_CONFIG = {
"config_path": "serverconfig.json"
}
Create serverconfig.json
{
"active_model": "gemma3-27b",
"models": [
{
"name": "gemma3-27b",
"model": "models/gemma-3-27b-it-Q4_K_M.gguf",
"n_ctx": 10000,
"n_gpu_layers": 80,
"n_threads": 12,
"n_batch": 512,
"verbose": false,
"rope_freq_base": 32000,
"rope_freq_scale": 1.0,
"offload_kqv": true,
"main_gpu": 1
}
]
}
Configuration Parameters:
n_ctx: Context window size (10000 = ~10K tokens)n_gpu_layers: Number of layers to offload to GPU (80 = most of the model)n_threads: CPU threads for processingn_batch: Batch size for processingrope_freq_base: RoPE frequency base (32000 for Gemma 3)rope_freq_scale: RoPE scaling factoroffload_kqv: Offload key/value cache to GPUmain_gpu: Primary GPU device ID
Usage
Standalone Testing
Test the server directly:
cd /path/to/chatbot-llm-server
# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate
python llm_server.py
You should see:
INFO:__main__:Starting MCP server...
INFO:__main__:✓ Background model initialization task started
INFO:__main__:Starting background model initialization...
INFO:__main__:✓ Model initialization completed successfully!
MCP Integration
The server is designed to be started by an MCP client. Configure your client with:
MCP_SERVERS = {
"llm": {
"command": "/path/to/chatbot-llm-server/llm-env/Scripts/python.exe", # Windows
# "command": "/path/to/chatbot-llm-server/llm-env/bin/python", # Linux/Mac
"args": ["/path/to/chatbot-llm-server/llm_server.py"],
"required": False,
"env": None
}
}
Available Tools
The server provides these MCP tools:
health_check: Get server and model statusinitialize_model: Manually trigger model loadingget_model_info: Get information about loaded modelchat_completion: Generate chat completions with tool support
Troubleshooting
Common Issues
"No module named 'llama_cpp'"
# Ensure you're in the right virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123
"unknown model architecture: 'gemma3'"
Your llama-cpp-python version is too old. Update to 0.3.8+:
pip install --upgrade llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123
"Failed to load model from file"
- Check that the model file exists in the
models/directory - Verify the file path in
serverconfig.jsonis correct - Ensure the model file isn't corrupted (should be several GB)
Model loads manually but not via MCP
- Verify the MCP client is using the correct Python environment
- Check that the working directory is set correctly
- Add debug logging to see what's different between manual and MCP execution
Testing Model Loading
Test if your model loads correctly:
cd /path/to/chatbot-llm-server
# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate
python -c "from llama_cpp import Llama; llm = Llama(model_path='models/your-model.gguf', verbose=False); print('Model loaded successfully')"
Debug Logging
Enable file logging by modifying the logging setup in llm_server.py:
log_filename = f"llm_server_debug_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
level=logging.DEBUG,
handlers=[
logging.StreamHandler(),
logging.FileHandler(log_filename, mode='w')
]
)
Performance Notes
- Memory Usage: A 27B Q4_K_M model uses ~15-20GB of VRAM
- Loading Time: Initial model loading takes 30-60 seconds
- Inference Speed: ~20-50 tokens/second depending on hardware
- Context Window: Configured for 10K tokens (can be increased if VRAM allows)
Model Compatibility
This server supports:
- Gemma 3 models (2B, 9B, 27B variants)
- Llama 3.1 models (8B, 70B variants)
- Mistral models (7B, 22B variants)
- Any GGUF format model supported by llama.cpp
Ensure your llama-cpp-python version supports the model architecture you want to use.
Development
File Structure
chatbot-llm-server/
├── llm_server.py # MCP server implementation
├── llm_engine.py # Model loading and inference engine
├── config.py # Configuration loader
├── serverconfig.json # Model configuration
├── models/ # Model files directory
├── llm-env/ # Virtual environment
└── README.md # This file
Adding New Models
- Add model to
serverconfig.json:
{
"name": "your-model-name",
"model": "models/your-model.gguf",
"n_ctx": 4096,
"n_gpu_layers": 40
}
-
Update the
active_modelfield to use the new model -
Restart the server
License
This project is part of the larger chatbot ecosystem. See individual license files for details.
Quick Start
Clone the repository
git clone https://github.com/grahzny/chatbot-llm-serverInstall dependencies
cd chatbot-llm-server
npm installFollow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
Recommended MCP Servers
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
Zapier MCP
A Zapier server that provides automation capabilities for various apps.