bprzybysz
MCP Serverbprzybyszpublic

spark

一个综合数据工程和机器学习管道,整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。

Repository Info

0
Stars
0
Forks
0
Watchers
0
Issues
Python
Language
-
License

About This Server

一个综合数据工程和机器学习管道,整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

Spark ETL ML Pipeline

A comprehensive data engineering and machine learning pipeline integrating Hive, Spark Streaming, Spark ETL, and TensorFlow with FastAPI serving.

Features

  • Data Engineering

    • Hive integration for data warehousing
    • Spark Streaming for real-time data processing
    • Batch ETL with PySpark
    • Automatic schema detection and validation
  • Machine Learning

    • Jupyter Notebook training environment
    • TensorFlow and PySpark ML integration
    • Model versioning and registry
    • Inference optimization
  • API & Serving

    • FastAPI for model serving
    • Pydantic schemas for validation
    • MCP (Message Communication Protocol) servers
    • External API integration
  • Core Architecture

    • Dependency Injection
    • Modular service architecture
    • Comprehensive logging
    • Configuration management

Getting Started

Prerequisites

  • Python 3.11 (strict requirement)
  • Java 11+
  • Spark 3.3+
  • Docker (optional)
  • Poetry (for dependency management)

Package Management

We use Poetry exclusively for dependency management to ensure reproducible builds and isolated environments:

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create new environment and install dependencies
poetry install

# Add new dependency
poetry add package-name

# Add development dependency
poetry add --group dev package-name

# Add documentation dependency
poetry add --group docs package-name

# Update dependencies
poetry update

# Run any command in the virtual environment
poetry run python script.py

# Activate the virtual environment
poetry shell

Environment Management

The project uses Poetry for environment isolation:

  • pyproject.toml - Project metadata and dependencies
  • poetry.lock - Locked dependencies for reproducible builds
  • .env - Environment variables (from .env.example)
  • config/ - Component-specific configurations
  • secrets/ - Secrets management (using HashiCorp Vault)

Secrets Management

Secrets are managed using HashiCorp Vault:

# Initialize Vault (first time only)
vault operator init

# Store a secret
vault kv put secret/database password=mypassword

# Retrieve a secret
vault kv get secret/database

See Secrets Management Guide for details.

Documentation Structure

docs/
├── architecture/          # System architecture documentation
│   ├── overview.md       # High-level system design
│   └── decisions.md      # Architecture decisions
├── api/                  # API documentation
│   ├── endpoints.md      # API endpoints
│   └── schemas.md        # Data schemas
├── data/                 # Data pipeline documentation
│   ├── etl.md           # ETL processes
│   └── streaming.md      # Streaming pipeline
├── ml/                   # Machine learning documentation
│   ├── models.md         # Model architecture
│   └── training.md       # Training pipeline
├── workflow/             # Workflow documentation
│   ├── progress.md       # Project progress tracking
│   └── development.md    # Development guidelines
└── security/             # Security documentation
    └── secrets.md        # Secrets management

Development Workflow

Code Generation

Use Cursor tools for code generation:

# Generate new service
cursor generate service user_management

# Generate new model
cursor generate model recommendation

Progress Tracking

Track development progress in workflow/progress.yaml:

# View current progress
poetry run python -m workflow.progress status

# Update task status
poetry run python -m workflow.progress update "Complete schema detection" --status done

Installation

  1. Clone the repository

    git clone https://github.com/your-org/spark-etl-ml-pipeline.git
    cd spark-etl-ml-pipeline
    
  2. Install Poetry and dependencies

    curl -sSL https://install.python-poetry.org | python3 -
    poetry install
    
  3. Set up environment variables

    cp .env.example .env
    # Edit .env with your configuration
    

Project Structure

├── docs/                  # Documentation
├── notebooks/             # Jupyter notebooks for exploration and training
├── src/                   # Source code
│   ├── api/               # API components
│   │   ├── fastapi/       # FastAPI implementation
│   │   └── mcp/          # MCP servers
│   ├── core/              # Core components
│   │   ├── config/        # Configuration management
│   │   ├── di/           # Dependency injection
│   │   └── logging/      # Logging infrastructure
│   ├── data/              # Data components
│   │   ├── etl/          # Batch ETL
│   │   ├── hive/         # Hive integration
│   │   ├── schema/       # Schema management
│   │   └── streams/      # Streaming components
│   ├── ml/                # Machine learning components
│   │   ├── inference/     # Model inference
│   │   ├── models/       # Model definitions
│   │   └── training/     # Model training
│   └── common/            # Shared utilities
├── tests/                 # Test suite
├── workflow/              # Workflow definitions
│   ├── etl/               # ETL workflows
│   ├── inference/         # Inference workflows
│   ├── streaming/         # Streaming workflows
│   └── training/         # Training workflows
├── .env.example           # Example environment variables
├── .gitignore            # Git ignore file
├── pyproject.toml        # Python project configuration
├── poetry.lock           # Locked dependencies
└── README.md             # This file

Development

Adding New Components

  1. Follow the modular architecture pattern
  2. Update the workflow/codebasemap.json with your new component
  3. Add appropriate tests
  4. Document your component

Testing

# Run all tests
pytest

# Run specific test category
pytest tests/data/

Documentation

For full documentation, see the docs directory or visit our documentation site.

Contributing

  1. Check workflow/progress.yaml for current status and next steps
  2. Follow the Development Guidelines
  3. Use Poetry for dependency management
  4. Ensure documentation is updated
  5. Add tests for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Quick Start

1

Clone the repository

git clone https://github.com/bprzybysz/spark
2

Install dependencies

cd spark
npm install
3

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownerbprzybysz
Repospark
LanguagePython
License-
Last fetched8/10/2025

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat
🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas
🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata
🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation