MCP Servergrahznypublic

chatbot llm server

基于 MCP 协议的大型语言模型推理服务器，支持 Gemma 3 等 GGUF 格式模型。

Repository Info

Stars

Forks

Watchers

Issues

Python

Language

License

View on GitHubGitHub Download DocumentationDocs

About This Server

基于 MCP 协议的大型语言模型推理服务器，支持 Gemma 3 等 GGUF 格式模型。

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

Chatbot LLM Server

A Model Control Protocol (MCP) server that provides Large Language Model (LLM) inference capabilities using llama.cpp through Python bindings. This server loads and serves Gemma 3 models (and other GGUF format models) for chat completion tasks.

Features

MCP Integration: Fully compatible with Model Control Protocol for seamless integration
CUDA Support: GPU acceleration for faster inference
Lazy Loading: Model loads in background after MCP connection is established
Tool Support: Built-in function calling capabilities
Health Monitoring: Real-time status reporting and health checks
Error Recovery: Graceful handling of model loading failures

Prerequisites

Python 3.11 or higher
NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)
NVIDIA GPU with CUDA 12.3+ (for GPU acceleration)

Installation

1. Create Virtual Environment

cd /path/to/chatbot-llm-server
python -m venv llm-env

# On Windows:
llm-env\Scripts\activate

# On Linux/Mac:
source llm-env/bin/activate

2. Install Dependencies

Basic MCP Support

pip install mcp

llama-cpp-python with CUDA Support

For CUDA 12.3 support (recommended):

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

Or install a specific version if needed (Windows example):

pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.8-cu123/llama_cpp_python-0.3.8-cp311-cp311-win_amd64.whl --force-reinstall --no-cache-dir

Verify Installation

python -c "import llama_cpp; print(f'llama-cpp-python version: {llama_cpp.__version__}')"
python -c "import mcp; print('MCP installed successfully')"

3. Download Model

Place your GGUF model file in the models/ directory:

chatbot-llm-server/
├── models/
│   └── gemma-3-27b-it-Q4_K_M.gguf
├── llm_server.py
├── llm_engine.py
├── config.py
└── serverconfig.json

4. Configuration

Create `config.py`

# LLM Configuration
LLM_CONFIG = {
    "config_path": "serverconfig.json"
}

Create `serverconfig.json`

{
  "active_model": "gemma3-27b",
  "models": [
    {
      "name": "gemma3-27b",
      "model": "models/gemma-3-27b-it-Q4_K_M.gguf",
      "n_ctx": 10000,
      "n_gpu_layers": 80,
      "n_threads": 12,
      "n_batch": 512,
      "verbose": false,
      "rope_freq_base": 32000,
      "rope_freq_scale": 1.0,
      "offload_kqv": true,
      "main_gpu": 1
    }
  ]
}

Configuration Parameters:

n_ctx: Context window size (10000 = ~10K tokens)
n_gpu_layers: Number of layers to offload to GPU (80 = most of the model)
n_threads: CPU threads for processing
n_batch: Batch size for processing
rope_freq_base: RoPE frequency base (32000 for Gemma 3)
rope_freq_scale: RoPE scaling factor
offload_kqv: Offload key/value cache to GPU
main_gpu: Primary GPU device ID

Usage

Standalone Testing

Test the server directly:

cd /path/to/chatbot-llm-server

# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

python llm_server.py

You should see:

INFO:__main__:Starting MCP server...
INFO:__main__:✓ Background model initialization task started
INFO:__main__:Starting background model initialization...
INFO:__main__:✓ Model initialization completed successfully!

MCP Integration

The server is designed to be started by an MCP client. Configure your client with:

MCP_SERVERS = {
    "llm": {
        "command": "/path/to/chatbot-llm-server/llm-env/Scripts/python.exe",  # Windows
        # "command": "/path/to/chatbot-llm-server/llm-env/bin/python",      # Linux/Mac
        "args": ["/path/to/chatbot-llm-server/llm_server.py"],
        "required": False,
        "env": None
    }
}

Available Tools

The server provides these MCP tools:

health_check: Get server and model status
initialize_model: Manually trigger model loading
get_model_info: Get information about loaded model
chat_completion: Generate chat completions with tool support

Troubleshooting

Common Issues

"No module named 'llama_cpp'"

# Ensure you're in the right virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

"unknown model architecture: 'gemma3'"

Your llama-cpp-python version is too old. Update to 0.3.8+:

pip install --upgrade llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

"Failed to load model from file"

Check that the model file exists in the models/ directory
Verify the file path in serverconfig.json is correct
Ensure the model file isn't corrupted (should be several GB)

Model loads manually but not via MCP

Verify the MCP client is using the correct Python environment
Check that the working directory is set correctly
Add debug logging to see what's different between manual and MCP execution

Testing Model Loading

Test if your model loads correctly:

cd /path/to/chatbot-llm-server

# Activate virtual environment
# Windows:
llm-env\Scripts\activate
# Linux/Mac:
source llm-env/bin/activate

python -c "from llama_cpp import Llama; llm = Llama(model_path='models/your-model.gguf', verbose=False); print('Model loaded successfully')"

Debug Logging

Enable file logging by modifying the logging setup in llm_server.py:

log_filename = f"llm_server_debug_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
    level=logging.DEBUG,
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler(log_filename, mode='w')
    ]
)

Performance Notes

Memory Usage: A 27B Q4_K_M model uses ~15-20GB of VRAM
Loading Time: Initial model loading takes 30-60 seconds
Inference Speed: ~20-50 tokens/second depending on hardware
Context Window: Configured for 10K tokens (can be increased if VRAM allows)

Model Compatibility

This server supports:

Gemma 3 models (2B, 9B, 27B variants)
Llama 3.1 models (8B, 70B variants)
Mistral models (7B, 22B variants)
Any GGUF format model supported by llama.cpp

Ensure your llama-cpp-python version supports the model architecture you want to use.

Development

File Structure

chatbot-llm-server/
├── llm_server.py          # MCP server implementation
├── llm_engine.py          # Model loading and inference engine
├── config.py              # Configuration loader
├── serverconfig.json      # Model configuration
├── models/                # Model files directory
├── llm-env/              # Virtual environment
└── README.md             # This file

Adding New Models

Add model to serverconfig.json:

{
  "name": "your-model-name",
  "model": "models/your-model.gguf", 
  "n_ctx": 4096,
  "n_gpu_layers": 40
}

Update the active_model field to use the new model
Restart the server

License

This project is part of the larger chatbot ecosystem. See individual license files for details.

Quick Start

Clone the repository

git clone https://github.com/grahzny/chatbot-llm-server

Install dependencies

cd chatbot-llm-server
npm install

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownergrahzny

Repochatbot-llm-server

LanguagePython

License-

Last fetched8/10/2025

Quick Links

Issues

Releases

License

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat

🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas

🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata

🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

⚡

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation

chatbot llm server

Repository Info

About This Server

Documentation

Chatbot LLM Server

Features

Prerequisites

Installation

1. Create Virtual Environment

2. Install Dependencies

Basic MCP Support

llama-cpp-python with CUDA Support

Verify Installation

3. Download Model

4. Configuration

Create config.py

Create serverconfig.json

Usage

Standalone Testing

MCP Integration

Available Tools

Troubleshooting

Common Issues

"No module named 'llama_cpp'"

"unknown model architecture: 'gemma3'"

"Failed to load model from file"

Model loads manually but not via MCP

Testing Model Loading

Debug Logging

Performance Notes

Model Compatibility

Development

File Structure

Adding New Models

License

Quick Start

Clone the repository

Install dependencies

Follow the documentation

Repository Details

Quick Links

Recommended MCP Servers

Discord MCP

Knit MCP

Apify MCP Server

BrowserStack MCP

Zapier MCP

Create `config.py`

Create `serverconfig.json`