
chatbot app
This is source repository for chatbot-app
Repository Info
About This Server
This is source repository for chatbot-app
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
RAG Reference Architecture using LLaMA Stack, OpenShift AI, and PGVector
Description
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant external knowledge to improve accuracy, reduce hallucinations, and support domain-specific conversations. This architecture uses:
- OpenShift AI for orchestration
- LLaMA Stack for standardizing the core building blocks and simplifying AI application development
- PGVector for semantic search
- Kubeflow Pipelines for data ingestion
- Streamlit UI for a user-friendly chatbot interface
Architecture Diagram
!RAG System Architecture
The architecture illustrates both the ingestion pipeline for document processing and the RAG pipeline for query handling. For more details click here.
Features
- Multi-Modal Data Ingestion for ingesting unstructured data
- Preprocessing pipelines for cleaning, chunking, and embedding generation using language models
- Vector Store Integration to store dense embeddings
- Integrates with LLMs to generate responses based on retrieved documents
- Streamlit based web application
- Runs on OpenShift AI for container orchestration and GPU acceleration
- Llama Stack to standardize the core building blocks and simplify AI application development
- Safety Guardrail to block harmful request / response
- Integration with MCP servers
Ingestion Use Cases
1. BYOD (Bring Your Own Document)
End users can upload files through a UI and receive contextual answers based on uploaded content.
2. Pre-Ingestion
Enterprise documents are pre-processed and ingested into the system for later querying via OpenShift AI/Kubeflow Pipelines.
Key Components
| Layer | Component | Description |
|---|---|---|
| UI Layer | Streamlit / React | Chat-based user interaction |
| Retrieval | Retriever | Vector search |
| Embedding | all-MiniLM-L6-v2 | Converts text to vectors |
| Vector DB | PostgreSQL + PGVector | Stores embeddings |
| LLM | Llama-3.2-3B-Instruct | Generates responses |
| Ingestor | Kubeflow Pipeline | Embeds documents and stores vectors |
| Storage | S3 Bucket | Document source |
Scalability & Performance
- KServe for auto-scaling the model and embedding pods
- GPU-based inference optimized using node selectors
- Horizontal scaling of ingestion and retrieval components
The kickstart supports two modes of deployments
- Local
- Openshift
OpenShift Installation
Minimum Requirements
- OpenShift Cluster 4.16+ with OpenShift AI
- OpenShift Client CLI - oc
- Helm CLI - helm
- huggingface-cli (Optional)
- 2 GPUs, each with 24GB of VRAM, one for LLM and the another for Safety Model, refer to the chart below
- Hugging Face Token
- Access to Meta Llama model.
- Access to Meta Llama Guard model.
- Some of the example scripts use
jqa JSON parsing utility which you can acquire viabrew install jq
Supported Models
| Function | Model Name | GPU | AWS |
|---|---|---|---|
| Embedding | all-MiniLM-L6-v2 | CPU or GPU | |
| Safety | meta-llama/Llama-Guard-3-8B | L4 | g6.2xlarge |
| Generation | meta-llama/Llama-3.2-3B-Instruct | L4 | g6.2xlarge |
| Generation | meta-llama/Meta-Llama-3-70B-Instruct | A100 x2 | p4d.24xlarge |
Note: 70B model is NOT required for initial testing of this example
Installation steps
- Clone the repo so you have a working copy
git clone https://github.com/RHEcosystemAppEng/RAG-Blueprint
- Login to your OpenShift Cluster
oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"
- Find your GPU worker node label and taint
List of worker nodes with worker-gpu label
oc get nodes -l node-role.kubernetes.io/worker-gpu
Or perhaps list of nodes with nvidia.com/gpu label prefix
oc get nodes -o json | jq -r '
.items[] |
select(
.metadata.labels | keys[] | startswith("nvidia.com/gpu")
) |
.metadata.name
'
Find one of those GPU-enabled worker nodes, assuming they are configured the same
NODE=$(oc get nodes -l node-role.kubernetes.io/worker-gpu --no-headers | head -n 1 | awk '{print $1}')
Extract the taints section of the worker node
oc get node $NODE -o json | jq '.spec.taints'
[
{
"effect": "NoSchedule",
"key": "nvidia.com/gpu",
"value": "True"
}
]
You can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled worker nodes. It is also possible that all your worker nodes have GPUs therefore have no distinguishing taint.
- Navigate to Helm deploy directory
cd deploy/helm
- List available models
make list-models
The above command will list the models to use in the next command
(Output)
model: llama-3-2-3b-instruct
model: llama-guard-3-8b (shield)
- Install via make
Use the taint key from above as the LLM_TOLERATION and SAFETY_TOLERATION
The namespace will be auto-created
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"
If you have no tainted nodes, perhaps every worker node has a GPU, then you can use a simplified version of the make command
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b
When prompted, enter your Hugging Face Token).
Note: This process often takes 11 to 30 minutes
- Watch/Monitor
oc get pods -n llama-stack-rag
(Output)
NAME READY STATUS RESTARTS AGE
llama-3-2-3b-instruct-predictor-00001-deployment-5499d774cwtqbc 3/3 Running 0 11m
llama-guard-3-8b-predictor-00001-deployment-7768b4d4ff-kqrnf 3/3 Running 0 11m
llamastack-7f59d798f8-8zkzw 1/1 Running 0 2m27s
mcp-servers-weather-65cff98c8b-n6qr7 1/1 Running 0 11m
minio-0 1/1 Running 0 11m
pgvector-0 1/1 Running 0 11m
rag-pipeline-notebook-0 2/2 Running 0 2m22s
rag-rag-ui-7f5dcb5cf4-8mfv6 1/1 Running 0 11m
- Verify:
oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-rag
Using the RAG UI
- Get the route url for the application
URL=http://$(oc get routes -l app.kubernetes.io/name=rag-ui -o jsonpath="{range .items[*]}{.status.ingress[0].host}{end}")
open $URL
!RAG UI Main
-
Click on RAG
-
Upload your document
-
Create a Vector Database
!RAG UI Main 2
-
Once you've recieved
Vector database created successfully!, select the Vector Database you created -
Ask a question pertaining to your document!
Refer to the post installation document for document ingestion.
- Uninstalling the RAG application
make uninstall NAMESPACE=llama-stack-rag
or
oc delete project llama-stack-rag
Defining a new model
To deploy a new model using the llm-service Helm chart or connect to an existing vLLM server, follow these steps:
-
Deploying a Model via
llm-serviceIf you're deploying the model with
llm-service, edit the filedeploy/helm/llm-service/values-gpu.yamland add a new model definition under the.modelssection to specify the model andyou want deployed with thellm-servicechart and its args:models: llama-3-2-3b-instruct: id: meta-llama/Llama-3.2-3B-Instruct enabled: false inferenceService: args: - --enable-auto-tool-choice - --chat-template - /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja - --tool-call-parser - llama3_json - --max-model-len - "30544" -
Update
llama-stackConfigurationEdit the file
deploy/helm/rag-ui/charts/llama-stack/values.yamland add a corresponding entry under.modelsfor the LLaMA stack configuration.llama-3-2-3b-instruct: id: meta-llama/Llama-3.2-3B-Instruct enabled: false url: local-ns
Notes:
- If the model is not deployed with
llm-servicein the same namespace asllama-stack, you do not need to modify thellm-servicevalues. Instead, just configure the the external model inllama-stackand replacelocal-nswith a url, and an optionalapiToken. - To use the new model, set the
enabledflags to true.
Local Development Setup
- From the root of the project, switch to the ui directory
cd ui
- Create a virtual environment (Python based development often works better with a virtual environment)
python3.11 -m venv venv
source venv/bin/activate
- Download the dependencies
pip install -r requirements.txt
- Port forward the service inside of OpenShift to the local machine on port 8321
oc port-forward svc/llamastack 8321:8321
- Launch the application and opens a browser tab with the
streamlitcommand
streamlit run app.py
- Give the weather MCP-based tool a test with a US-based city by toggling on "mcp::weather" via a click for real-time weather information
!RAG UI MCP weather
Redeploy Changes
Make changes to app.py
Deployment after making changes requires a rebuild of the container image using either docker or podman. Replace docker.io with your target container registry such as quay.io.
docker buildx build --platform linux/amd64,linux/arm64 -t docker.io/burrsutter/rag-ui:v1 -f Containerfile .
docker push docker.io/burrsutter/rag-ui:v1
Add modification to deploy/helm/rag-ui/values.yaml
image:
repository: docker.io/burrsutter/rag-ui
pullPolicy: IfNotPresent
tag: v1
To redeploy to the cluster run the same make command as you did before.
Shields
export LLAMA_STACK_ENDPOINT=http://localhost:8321
First see what models are available
curl -sS $LLAMA_STACK_SERVER/v1/models -H "Content-Type: application/json" | jq -r '.data[].identifier'
(Output)
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-Guard-3-8B
all-MiniLM-L6-v2
The "Guard" model is the one appropriate for adding as a Llama Stack Shield.
From within the ui directory or whichever one has the venv with the dependencies:
- Register the shield
python ../shields/register-shield.py
- List shields
python ../shields/list-shields.py
- Test the shield
python ../shields/test-shield.py
(Output)
LLAMA_STACK_ENDPOINT: http://localhost:8321
LLAMA_STACK_MODEL: meta-llama/Llama-3.2-3B-Instruct
Safety violation detected: I can't answer that. Can I help with something else?
'response: <generator object Agent._create_turn_streaming at 0x1052ecd60>'
shield_call> No Violation
inference> The friendly stranger smiled and said hello as she approached the table where I was sitting alone.
'response: <generator object Agent._create_turn_streaming at 0x1052ed000>'
shield_call> {'violation_type': 'S1'} I can't answer that. Can I help with something else?
Quick Start
Clone the repository
git clone https://github.com/rocrisp/chatbot-appInstall dependencies
cd chatbot-app
npm installFollow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
Recommended MCP Servers
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
Zapier MCP
A Zapier server that provides automation capabilities for various apps.