grahzny
MCP Servergrahznypublic

chatbot llm server

基于 MCP 协议的大型语言模型推理服务器,支持 Gemma 3 等 GGUF 格式模型。

Repository Info

0
Stars
0
Forks
0
Watchers
0
Issues
Python
Language
-
License

About This Server

基于 MCP 协议的大型语言模型推理服务器,支持 Gemma 3 等 GGUF 格式模型。

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

Chatbot LLM Server

A Model Control Protocol (MCP) server that provides Large Language Model (LLM) inference capabilities using llama.cpp through Python bindings. This server loads and serves Gemma 3 models (and other GGUF format models) for chat completion tasks.

Features

  • MCP Integration: Fully compatible with Model Control Protocol for seamless integration
  • CUDA Support: GPU acceleration for faster inference
  • Lazy Loading: Model loads in background after MCP connection is established
  • Tool Support: Built-in function calling capabilities
  • Health Monitoring: Real-time status reporting and health checks
  • Error Recovery: Graceful handling of model loading failures

Prerequisites

  • Python 3.11 or higher
  • NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)
  • NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)

Installation

1. Create Virtual Environment

cd /path/to/chatbot-llm-server
python -m venv llm-env

# On Windows:
llm-env\Scripts\activate

# On Linux/Mac:
source llm-env/bin/activate

2. Install Dependencies

Basic MCP Support

pip install mcp

llama-cpp-python with CUDA Support

For CUDA 12.3 support (recommended):

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

Or install a specific version if needed (Windows example):

pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.8-cu123/llama_cpp_python-0.3.8-cp311-cp311-win_amd64.whl --force-reinstall --no-cache-dir

Verify Installation

python -c "import llama_cpp; print(f'llama-cpp-python version: {llama_cpp.__version__}')"
python -c "import mcp; print('MCP installed successfully')"

3. Download Model

Place your GGUF model file in the models/ directory:

chatbot-llm-server/
├── models/
│   └── gemma-3-27b-it-Q4_K_M.gguf
├── llm_server.py
├── llm_engine.py
├── config.py
└── serverconfig.json

4. Configuration

Create config.py

# LLM Configuration
LLM_CONFIG = {
    "config_path": "serverconfig.json"
}

Create serverconfig.json

{
  "active_model": "gemma3-27b",
  "models": [
    {
      "name": "gemma3-27b",
      "model": "models/gemma-3-27b-it-Q4_K_M.gguf",
      "n_ctx": 10000,
      "n_gpu_layers": 80,
      "n_threads": 12,
      "n_batch": 512,
      "verbose": false,
      "rope_freq_base": 32000,
      "rope_freq_scale": 1.0,
      "offload_kqv": true,
      "main_gpu": 1
    }
  ]
}

Configuration Parameters:

  • n_ctx: Context window size (10000 = ~10K tokens)
  • n_gpu_layers: Number of layers to offload to GPU (80 = most of the model)
  • n_threads: CPU threads for processing
  • n_batch: Batch size for processing
  • rope_freq_base: RoPE frequency base (32000 for Gemma 3)
  • rope_freq_scale: RoPE scaling factor
  • offload_kqv: Offload key/value cache to GPU
  • main_gpu: Primary GPU device ID

Usage

Standalone Testing

Test the server directly:

cd /path/to/chatbot-llm-server

# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

python llm_server.py

You should see:

INFO:__main__:Starting MCP server...
INFO:__main__:✓ Background model initialization task started
INFO:__main__:Starting background model initialization...
INFO:__main__:✓ Model initialization completed successfully!

MCP Integration

The server is designed to be started by an MCP client. Configure your client with:

MCP_SERVERS = {
    "llm": {
        "command": "/path/to/chatbot-llm-server/llm-env/Scripts/python.exe",  # Windows
        # "command": "/path/to/chatbot-llm-server/llm-env/bin/python",      # Linux/Mac
        "args": ["/path/to/chatbot-llm-server/llm_server.py"],
        "required": False,
        "env": None
    }
}

Available Tools

The server provides these MCP tools:

  • health_check: Get server and model status
  • initialize_model: Manually trigger model loading
  • get_model_info: Get information about loaded model
  • chat_completion: Generate chat completions with tool support

Troubleshooting

Common Issues

"No module named 'llama_cpp'"

# Ensure you're in the right virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

"unknown model architecture: 'gemma3'"

Your llama-cpp-python version is too old. Update to 0.3.8+:

pip install --upgrade llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

"Failed to load model from file"

  • Check that the model file exists in the models/ directory
  • Verify the file path in serverconfig.json is correct
  • Ensure the model file isn't corrupted (should be several GB)

Model loads manually but not via MCP

  • Verify the MCP client is using the correct Python environment
  • Check that the working directory is set correctly
  • Add debug logging to see what's different between manual and MCP execution

Testing Model Loading

Test if your model loads correctly:

cd /path/to/chatbot-llm-server

# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

python -c "from llama_cpp import Llama; llm = Llama(model_path='models/your-model.gguf', verbose=False); print('Model loaded successfully')"

Debug Logging

Enable file logging by modifying the logging setup in llm_server.py:

log_filename = f"llm_server_debug_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
    level=logging.DEBUG,
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler(log_filename, mode='w')
    ]
)

Performance Notes

  • Memory Usage: A 27B Q4_K_M model uses ~15-20GB of VRAM
  • Loading Time: Initial model loading takes 30-60 seconds
  • Inference Speed: ~20-50 tokens/second depending on hardware
  • Context Window: Configured for 10K tokens (can be increased if VRAM allows)

Model Compatibility

This server supports:

  • Gemma 3 models (2B, 9B, 27B variants)
  • Llama 3.1 models (8B, 70B variants)
  • Mistral models (7B, 22B variants)
  • Any GGUF format model supported by llama.cpp

Ensure your llama-cpp-python version supports the model architecture you want to use.

Development

File Structure

chatbot-llm-server/
├── llm_server.py          # MCP server implementation
├── llm_engine.py          # Model loading and inference engine
├── config.py              # Configuration loader
├── serverconfig.json      # Model configuration
├── models/                # Model files directory
├── llm-env/              # Virtual environment
└── README.md             # This file

Adding New Models

  1. Add model to serverconfig.json:
{
  "name": "your-model-name",
  "model": "models/your-model.gguf", 
  "n_ctx": 4096,
  "n_gpu_layers": 40
}
  1. Update the active_model field to use the new model

  2. Restart the server

License

This project is part of the larger chatbot ecosystem. See individual license files for details.

Quick Start

1

Clone the repository

git clone https://github.com/grahzny/chatbot-llm-server
2

Install dependencies

cd chatbot-llm-server
npm install
3

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownergrahzny
Repochatbot-llm-server
LanguagePython
License-
Last fetched8/10/2025

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat
🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas
🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata
🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation