
spark
一个综合数据工程和机器学习管道,整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。
Repository Info
About This Server
一个综合数据工程和机器学习管道,整合 Hive、Spark Streaming、TensorFlow 和 FastAPI。
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
Spark ETL ML Pipeline
A comprehensive data engineering and machine learning pipeline integrating Hive, Spark Streaming, Spark ETL, and TensorFlow with FastAPI serving.
Features
-
Data Engineering
- Hive integration for data warehousing
- Spark Streaming for real-time data processing
- Batch ETL with PySpark
- Automatic schema detection and validation
-
Machine Learning
- Jupyter Notebook training environment
- TensorFlow and PySpark ML integration
- Model versioning and registry
- Inference optimization
-
API & Serving
- FastAPI for model serving
- Pydantic schemas for validation
- MCP (Message Communication Protocol) servers
- External API integration
-
Core Architecture
- Dependency Injection
- Modular service architecture
- Comprehensive logging
- Configuration management
Getting Started
Prerequisites
- Python 3.11 (strict requirement)
- Java 11+
- Spark 3.3+
- Docker (optional)
- Poetry (for dependency management)
Package Management
We use Poetry exclusively for dependency management to ensure reproducible builds and isolated environments:
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Create new environment and install dependencies
poetry install
# Add new dependency
poetry add package-name
# Add development dependency
poetry add --group dev package-name
# Add documentation dependency
poetry add --group docs package-name
# Update dependencies
poetry update
# Run any command in the virtual environment
poetry run python script.py
# Activate the virtual environment
poetry shell
Environment Management
The project uses Poetry for environment isolation:
pyproject.toml- Project metadata and dependenciespoetry.lock- Locked dependencies for reproducible builds.env- Environment variables (from .env.example)config/- Component-specific configurationssecrets/- Secrets management (using HashiCorp Vault)
Secrets Management
Secrets are managed using HashiCorp Vault:
# Initialize Vault (first time only)
vault operator init
# Store a secret
vault kv put secret/database password=mypassword
# Retrieve a secret
vault kv get secret/database
See Secrets Management Guide for details.
Documentation Structure
docs/
├── architecture/ # System architecture documentation
│ ├── overview.md # High-level system design
│ └── decisions.md # Architecture decisions
├── api/ # API documentation
│ ├── endpoints.md # API endpoints
│ └── schemas.md # Data schemas
├── data/ # Data pipeline documentation
│ ├── etl.md # ETL processes
│ └── streaming.md # Streaming pipeline
├── ml/ # Machine learning documentation
│ ├── models.md # Model architecture
│ └── training.md # Training pipeline
├── workflow/ # Workflow documentation
│ ├── progress.md # Project progress tracking
│ └── development.md # Development guidelines
└── security/ # Security documentation
└── secrets.md # Secrets management
Development Workflow
Code Generation
Use Cursor tools for code generation:
# Generate new service
cursor generate service user_management
# Generate new model
cursor generate model recommendation
Progress Tracking
Track development progress in workflow/progress.yaml:
# View current progress
poetry run python -m workflow.progress status
# Update task status
poetry run python -m workflow.progress update "Complete schema detection" --status done
Installation
-
Clone the repository
git clone https://github.com/your-org/spark-etl-ml-pipeline.git cd spark-etl-ml-pipeline -
Install Poetry and dependencies
curl -sSL https://install.python-poetry.org | python3 - poetry install -
Set up environment variables
cp .env.example .env # Edit .env with your configuration
Project Structure
├── docs/ # Documentation
├── notebooks/ # Jupyter notebooks for exploration and training
├── src/ # Source code
│ ├── api/ # API components
│ │ ├── fastapi/ # FastAPI implementation
│ │ └── mcp/ # MCP servers
│ ├── core/ # Core components
│ │ ├── config/ # Configuration management
│ │ ├── di/ # Dependency injection
│ │ └── logging/ # Logging infrastructure
│ ├── data/ # Data components
│ │ ├── etl/ # Batch ETL
│ │ ├── hive/ # Hive integration
│ │ ├── schema/ # Schema management
│ │ └── streams/ # Streaming components
│ ├── ml/ # Machine learning components
│ │ ├── inference/ # Model inference
│ │ ├── models/ # Model definitions
│ │ └── training/ # Model training
│ └── common/ # Shared utilities
├── tests/ # Test suite
├── workflow/ # Workflow definitions
│ ├── etl/ # ETL workflows
│ ├── inference/ # Inference workflows
│ ├── streaming/ # Streaming workflows
│ └── training/ # Training workflows
├── .env.example # Example environment variables
├── .gitignore # Git ignore file
├── pyproject.toml # Python project configuration
├── poetry.lock # Locked dependencies
└── README.md # This file
Development
Adding New Components
- Follow the modular architecture pattern
- Update the
workflow/codebasemap.jsonwith your new component - Add appropriate tests
- Document your component
Testing
# Run all tests
pytest
# Run specific test category
pytest tests/data/
Documentation
For full documentation, see the docs directory or visit our documentation site.
Contributing
- Check
workflow/progress.yamlfor current status and next steps - Follow the Development Guidelines
- Use Poetry for dependency management
- Ensure documentation is updated
- Add tests for new features
License
This project is licensed under the MIT License - see the LICENSE file for details.
Quick Start
Clone the repository
git clone https://github.com/bprzybysz/sparkInstall dependencies
cd spark
npm installFollow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
Recommended MCP Servers
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
Zapier MCP
A Zapier server that provides automation capabilities for various apps.