
agentic etl
Power data scraping and transformation with a Cursor Agent and MCP tools
Repository Info
About This Server
Power data scraping and transformation with a Cursor Agent and MCP tools
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
Agentic ETL
This repository contains a collection of Cursor rules, helper scripts, and documentation to power a generalized agentic ETL process with minimal human intervention. The idea is to enable an automated AI agent to:
- Locate the most accessible and highest quality online data sources for information you want to retrieve
- Determine the best way to extract the data using a combination of fetch requests, data APIs, sophisticated scraping systems, or LLM browser use
- Design a JSON schema for chunking, extracting, or annotating the scraped information to turn it into usable structured data
- Use LLM tools to perform the transformation according to the schema and save the results to JSON
- Optionally upload the raw and/or structured JSON data to a database
This workflow is designed to power RAG pipelines, but can be used for any ETL process.
Workflow Documentation
Most ETL repositories on Github are mechanical software tools designed for a specific use case. This one is more like a generalized cookbook or recipe for an LLM-powered agent to follow for using existing tools to support a wider range of use cases. The workflow is designed to be executed by a Claude 3.7 Sonnet-powered Cursor Agent, with Cursor rules in the .cursor/rules directory and helpful scripts and documentation in the project root.
After presenting your use case to the agent, you should prompt the agent to follow the steps in the order listed below.
- Data Collection - The agent can reference the Bash or Python scraping rules for help with locating and scraping data.
- Data Processing - The agent can reference the Bash or Python cleaning rules for help with transforming scraped data into clean JSON.
- Data Upload - The agent can reference the Digital Ocean PostgreSQL setup and upload rules for help with setting up a database and uploading the data.
More languages, tools, and deployment options can be supported by adding additional Cursor rules to the .cursor/rules folder.
Prerequisites
Before getting started, I highly recommend installing at least the following:
- Cursor
- Node.js
- uv
Your Cursor Agent will walk you through the setup process for any additional required tools and resources for your use case, or perhaps even obtain and install them itself.
Contributing
If you'd like to contribute to this project, please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
License
MIT
Quick Start
Clone the repository
git clone https://github.com/chriscarrollsmith/agentic-etlInstall dependencies
cd agentic-etl
npm installFollow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
Recommended MCP Servers
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
Zapier MCP
A Zapier server that provides automation capabilities for various apps.