MCP Serverrocrisppublic

chatbot app

This is source repository for chatbot-app

Repository Info

Stars

Forks

Watchers

Issues

Python

Language

MIT License

License

View on GitHubGitHub Download DocumentationDocs

About This Server

This is source repository for chatbot-app

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

RAG Reference Architecture using LLaMA Stack, OpenShift AI, and PGVector

Description

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant external knowledge to improve accuracy, reduce hallucinations, and support domain-specific conversations. This architecture uses:

OpenShift AI for orchestration
LLaMA Stack for standardizing the core building blocks and simplifying AI application development
PGVector for semantic search
Kubeflow Pipelines for data ingestion
Streamlit UI for a user-friendly chatbot interface

Architecture Diagram

!RAG System Architecture

The architecture illustrates both the ingestion pipeline for document processing and the RAG pipeline for query handling. For more details click here.

Features

Multi-Modal Data Ingestion for ingesting unstructured data
Preprocessing pipelines for cleaning, chunking, and embedding generation using language models
Vector Store Integration to store dense embeddings
Integrates with LLMs to generate responses based on retrieved documents
Streamlit based web application
Runs on OpenShift AI for container orchestration and GPU acceleration
Llama Stack to standardize the core building blocks and simplify AI application development
Safety Guardrail to block harmful request / response
Integration with MCP servers

Ingestion Use Cases

1. BYOD (Bring Your Own Document)

End users can upload files through a UI and receive contextual answers based on uploaded content.

2. Pre-Ingestion

Enterprise documents are pre-processed and ingested into the system for later querying via OpenShift AI/Kubeflow Pipelines.

Key Components

Layer	Component	Description
UI Layer	Streamlit / React	Chat-based user interaction
Retrieval	Retriever	Vector search
Embedding	`all-MiniLM-L6-v2`	Converts text to vectors
Vector DB	PostgreSQL + PGVector	Stores embeddings
LLM	`Llama-3.2-3B-Instruct`	Generates responses
Ingestor	Kubeflow Pipeline	Embeds documents and stores vectors
Storage	S3 Bucket	Document source

Scalability & Performance

KServe for auto-scaling the model and embedding pods
GPU-based inference optimized using node selectors
Horizontal scaling of ingestion and retrieval components

The kickstart supports two modes of deployments

Local
Openshift

OpenShift Installation

Minimum Requirements

OpenShift Cluster 4.16+ with OpenShift AI
OpenShift Client CLI - oc
Helm CLI - helm
huggingface-cli (Optional)
2 GPUs, each with 24GB of VRAM, one for LLM and the another for Safety Model, refer to the chart below
Hugging Face Token
Access to Meta Llama model.
Access to Meta Llama Guard model.
Some of the example scripts use jq a JSON parsing utility which you can acquire via brew install jq

Supported Models

Function	Model Name	GPU	AWS
Embedding	`all-MiniLM-L6-v2`	CPU or GPU
Safety	`meta-llama/Llama-Guard-3-8B`	L4	g6.2xlarge
Generation	`meta-llama/Llama-3.2-3B-Instruct`	L4	g6.2xlarge
Generation	`meta-llama/Meta-Llama-3-70B-Instruct`	A100 x2	p4d.24xlarge

Note: 70B model is NOT required for initial testing of this example

Installation steps

Clone the repo so you have a working copy

git clone https://github.com/RHEcosystemAppEng/RAG-Blueprint

oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"

Find your GPU worker node label and taint

List of worker nodes with worker-gpu label

oc get nodes -l node-role.kubernetes.io/worker-gpu

Or perhaps list of nodes with nvidia.com/gpu label prefix

oc get nodes -o json | jq -r '
  .items[] |
  select(
    .metadata.labels | keys[] | startswith("nvidia.com/gpu")
  ) |
  .metadata.name
'

Find one of those GPU-enabled worker nodes, assuming they are configured the same

NODE=$(oc get nodes -l node-role.kubernetes.io/worker-gpu --no-headers | head -n 1 | awk '{print $1}')

Extract the taints section of the worker node

oc get node $NODE -o json | jq '.spec.taints'

[
  {
    "effect": "NoSchedule",
    "key": "nvidia.com/gpu",
    "value": "True"
  }
]

You can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled worker nodes. It is also possible that all your worker nodes have GPUs therefore have no distinguishing taint.

Navigate to Helm deploy directory

cd deploy/helm

List available models

make list-models

The above command will list the models to use in the next command

(Output)
model: llama-3-2-3b-instruct
model: llama-guard-3-8b (shield)

Install via make

Use the taint key from above as the LLM_TOLERATION and SAFETY_TOLERATION

The namespace will be auto-created

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"

If you have no tainted nodes, perhaps every worker node has a GPU, then you can use a simplified version of the make command

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b

When prompted, enter your Hugging Face Token).

Note: This process often takes 11 to 30 minutes

Watch/Monitor

oc get pods -n llama-stack-rag

(Output)
NAME                                                              READY   STATUS    RESTARTS   AGE
llama-3-2-3b-instruct-predictor-00001-deployment-5499d774cwtqbc   3/3     Running   0          11m
llama-guard-3-8b-predictor-00001-deployment-7768b4d4ff-kqrnf      3/3     Running   0          11m
llamastack-7f59d798f8-8zkzw                                       1/1     Running   0          2m27s
mcp-servers-weather-65cff98c8b-n6qr7                              1/1     Running   0          11m
minio-0                                                           1/1     Running   0          11m
pgvector-0                                                        1/1     Running   0          11m
rag-pipeline-notebook-0                                           2/2     Running   0          2m22s
rag-rag-ui-7f5dcb5cf4-8mfv6                                       1/1     Running   0          11m

Verify:

oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-rag

Using the RAG UI

Get the route url for the application

URL=http://$(oc get routes -l app.kubernetes.io/name=rag-ui -o jsonpath="{range .items[*]}{.status.ingress[0].host}{end}")
open $URL

!RAG UI Main

Click on RAG
Upload your document
Create a Vector Database

!RAG UI Main 2

Once you've recieved Vector database created successfully!, select the Vector Database you created
Ask a question pertaining to your document!

Refer to the post installation document for document ingestion.

Uninstalling the RAG application

make uninstall NAMESPACE=llama-stack-rag

oc delete project llama-stack-rag

Defining a new model

To deploy a new model using the llm-service Helm chart or connect to an existing vLLM server, follow these steps:

Deploying a Model via llm-service

If you're deploying the model with llm-service, edit the file deploy/helm/llm-service/values-gpu.yaml and add a new model definition under the .models section to specify the model andyou want deployed with the llm-service chart and its args:

  models:
    llama-3-2-3b-instruct:
      id: meta-llama/Llama-3.2-3B-Instruct
      enabled: false
      inferenceService:
        args:
        - --enable-auto-tool-choice
        - --chat-template
        - /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja
        - --tool-call-parser
        - llama3_json
        - --max-model-len
        - "30544"

Update llama-stack Configuration

Edit the file deploy/helm/rag-ui/charts/llama-stack/values.yaml and add a corresponding entry under .models for the LLaMA stack configuration.
```
  llama-3-2-3b-instruct:
    id: meta-llama/Llama-3.2-3B-Instruct
    enabled: false
    url: local-ns
```

Notes:

If the model is not deployed with llm-service in the same namespace as llama-stack, you do not need to modify the llm-service values. Instead, just configure the the external model in llama-stack and replace local-ns with a url, and an optional apiToken.
To use the new model, set the enabled flags to true.

Local Development Setup

From the root of the project, switch to the ui directory

cd ui

Create a virtual environment (Python based development often works better with a virtual environment)

python3.11 -m venv venv
source venv/bin/activate

Download the dependencies

pip install -r requirements.txt

Port forward the service inside of OpenShift to the local machine on port 8321

oc port-forward svc/llamastack 8321:8321

Launch the application and opens a browser tab with the streamlit command

streamlit run app.py

Give the weather MCP-based tool a test with a US-based city by toggling on "mcp::weather" via a click for real-time weather information

!RAG UI MCP weather

Redeploy Changes

Make changes to app.py

Deployment after making changes requires a rebuild of the container image using either docker or podman. Replace docker.io with your target container registry such as quay.io.

docker buildx build --platform linux/amd64,linux/arm64 -t docker.io/burrsutter/rag-ui:v1 -f Containerfile .

docker push docker.io/burrsutter/rag-ui:v1

Add modification to deploy/helm/rag-ui/values.yaml

image:
  repository: docker.io/burrsutter/rag-ui
  pullPolicy: IfNotPresent
  tag: v1

To redeploy to the cluster run the same make command as you did before.

Shields

export LLAMA_STACK_ENDPOINT=http://localhost:8321

First see what models are available

curl -sS $LLAMA_STACK_SERVER/v1/models -H "Content-Type: application/json" | jq -r '.data[].identifier'

(Output)
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-Guard-3-8B
all-MiniLM-L6-v2

The "Guard" model is the one appropriate for adding as a Llama Stack Shield.

From within the ui directory or whichever one has the venv with the dependencies:

python ../shields/register-shield.py

List shields

python ../shields/list-shields.py

Test the shield

python ../shields/test-shield.py

(Output)
LLAMA_STACK_ENDPOINT: http://localhost:8321
LLAMA_STACK_MODEL: meta-llama/Llama-3.2-3B-Instruct
Safety violation detected: I can't answer that. Can I help with something else?
'response: <generator object Agent._create_turn_streaming at 0x1052ecd60>'
shield_call> No Violation
inference> The friendly stranger smiled and said hello as she approached the table where I was sitting alone.
'response: <generator object Agent._create_turn_streaming at 0x1052ed000>'
shield_call> {'violation_type': 'S1'} I can't answer that. Can I help with something else?

Quick Start

Clone the repository

git clone https://github.com/rocrisp/chatbot-app

Install dependencies

cd chatbot-app
npm install

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownerrocrisp

Repochatbot-app

LanguagePython

LicenseMIT License

Last fetched8/10/2025

Quick Links

Issues

Releases

License

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat

🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas

🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata

🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

⚡

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation