MCP Serverbprzybyszpublic

spark

一个综合数据工程和机器学习管道，整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。

Repository Info

Stars

Forks

Watchers

Issues

Python

Language

License

View on GitHubGitHub Download DocumentationDocs

About This Server

一个综合数据工程和机器学习管道，整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

Spark ETL ML Pipeline

A comprehensive data engineering and machine learning pipeline integrating Hive, Spark Streaming, Spark ETL, and TensorFlow with FastAPI serving.

Features

Data Engineering
- Hive integration for data warehousing
- Spark Streaming for real-time data processing
- Batch ETL with PySpark
- Automatic schema detection and validation
Machine Learning
- Jupyter Notebook training environment
- TensorFlow and PySpark ML integration
- Model versioning and registry
- Inference optimization
API & Serving
- FastAPI for model serving
- Pydantic schemas for validation
- MCP (Message Communication Protocol) servers
- External API integration
Core Architecture
- Dependency Injection
- Modular service architecture
- Comprehensive logging
- Configuration management

Getting Started

Prerequisites

Python 3.11 (strict requirement)
Java 11+
Spark 3.3+
Docker (optional)
Poetry (for dependency management)

Package Management

We use Poetry exclusively for dependency management to ensure reproducible builds and isolated environments:

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create new environment and install dependencies
poetry install

# Add new dependency
poetry add package-name

# Add development dependency
poetry add --group dev package-name

# Add documentation dependency
poetry add --group docs package-name

# Update dependencies
poetry update

# Run any command in the virtual environment
poetry run python script.py

# Activate the virtual environment
poetry shell

Environment Management

The project uses Poetry for environment isolation:

pyproject.toml - Project metadata and dependencies
poetry.lock - Locked dependencies for reproducible builds
.env - Environment variables (from .env.example)
config/ - Component-specific configurations
secrets/ - Secrets management (using HashiCorp Vault)

Secrets Management

Secrets are managed using HashiCorp Vault:

# Initialize Vault (first time only)
vault operator init

# Store a secret
vault kv put secret/database password=mypassword

# Retrieve a secret
vault kv get secret/database

See Secrets Management Guide for details.

Documentation Structure

docs/
├── architecture/          # System architecture documentation
│   ├── overview.md       # High-level system design
│   └── decisions.md      # Architecture decisions
├── api/                  # API documentation
│   ├── endpoints.md      # API endpoints
│   └── schemas.md        # Data schemas
├── data/                 # Data pipeline documentation
│   ├── etl.md           # ETL processes
│   └── streaming.md      # Streaming pipeline
├── ml/                   # Machine learning documentation
│   ├── models.md         # Model architecture
│   └── training.md       # Training pipeline
├── workflow/             # Workflow documentation
│   ├── progress.md       # Project progress tracking
│   └── development.md    # Development guidelines
└── security/             # Security documentation
    └── secrets.md        # Secrets management

Development Workflow

Code Generation

Use Cursor tools for code generation:

# Generate new service
cursor generate service user_management

# Generate new model
cursor generate model recommendation

Progress Tracking

Track development progress in workflow/progress.yaml:

# View current progress
poetry run python -m workflow.progress status

# Update task status
poetry run python -m workflow.progress update "Complete schema detection" --status done

Installation

Clone the repository

git clone https://github.com/your-org/spark-etl-ml-pipeline.git
cd spark-etl-ml-pipeline

Install Poetry and dependencies

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Set up environment variables

cp .env.example .env
# Edit .env with your configuration

Project Structure

├── docs/                  # Documentation
├── notebooks/             # Jupyter notebooks for exploration and training
├── src/                   # Source code
│   ├── api/               # API components
│   │   ├── fastapi/       # FastAPI implementation
│   │   └── mcp/          # MCP servers
│   ├── core/              # Core components
│   │   ├── config/        # Configuration management
│   │   ├── di/           # Dependency injection
│   │   └── logging/      # Logging infrastructure
│   ├── data/              # Data components
│   │   ├── etl/          # Batch ETL
│   │   ├── hive/         # Hive integration
│   │   ├── schema/       # Schema management
│   │   └── streams/      # Streaming components
│   ├── ml/                # Machine learning components
│   │   ├── inference/     # Model inference
│   │   ├── models/       # Model definitions
│   │   └── training/     # Model training
│   └── common/            # Shared utilities
├── tests/                 # Test suite
├── workflow/              # Workflow definitions
│   ├── etl/               # ETL workflows
│   ├── inference/         # Inference workflows
│   ├── streaming/         # Streaming workflows
│   └── training/         # Training workflows
├── .env.example           # Example environment variables
├── .gitignore            # Git ignore file
├── pyproject.toml        # Python project configuration
├── poetry.lock           # Locked dependencies
└── README.md             # This file

Development

Adding New Components

Follow the modular architecture pattern
Update the workflow/codebasemap.json with your new component
Add appropriate tests
Document your component

Testing

# Run all tests
pytest

# Run specific test category
pytest tests/data/

Documentation

For full documentation, see the docs directory or visit our documentation site.

Contributing

Check workflow/progress.yaml for current status and next steps
Follow the Development Guidelines
Use Poetry for dependency management
Ensure documentation is updated
Add tests for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Quick Start

Clone the repository

git clone https://github.com/bprzybysz/spark

Install dependencies

cd spark
npm install

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownerbprzybysz

Repospark

LanguagePython

License-

Last fetched8/10/2025

Quick Links

Issues

Releases

License

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat

🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas

🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata

🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

⚡

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation