bubroz
MCP Serverbubrozpublic

fire 1 agent

基于 Firecrawl v1 API 的高级网页抓取代理,支持移动端模拟、复杂操作和结构化数据提取。

Repository Info

0
Stars
0
Forks
0
Watchers
0
Issues
Python
Language
-
License

About This Server

基于 Firecrawl v1 API 的高级网页抓取代理,支持移动端模拟、复杂操作和结构化数据提取。

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

🔥 FIRE-1 Agent

Next-Generation Web Scraping Agent powered by Firecrawl v1 API

A comprehensive web scraping solution optimized for the latest Firecrawl capabilities, featuring mobile device emulation, advanced actions, and seamless integration with agentic frameworks.

🤖 FIRE-1 Agent vs. Firecrawl MCP: Better Together

FIRE-1 Agent and Firecrawl's MCP Server are complementary tools that solve different problems:

🛠️ Firecrawl MCP (AI Integration Layer)

  • Purpose: Direct AI assistant access to web scraping via Model Context Protocol
  • Best for: Real-time AI interactions, conversational scraping, IDE integrations
  • Use cases: "Hey Claude, scrape this website for me" or automated AI workflows

🚀 FIRE-1 Agent (Advanced Framework)

  • Purpose: Sophisticated scraping automation with custom logic and workflows
  • Best for: Complex scraping projects, specialized retry logic, batch operations
  • Use cases: Large-scale data collection, difficult sites, custom data processing

🔥 FIRE-1 MCP ServerNOW AVAILABLE!

  • Purpose: Best of both worlds - Advanced FIRE-1 Agent capabilities accessible via MCP
  • Best for: AI assistants needing sophisticated scraping with retry logic, mobile emulation, actions
  • Use cases: "Use FIRE-1 Agent to scrape this mobile site with screenshots and data extraction"

💡 Perfect Integration

# Option 1: Use FIRE-1 Agent directly for complex automation
fire_agent = FireAgent()
data = await fire_agent.scrape_with_retries(difficult_url)

# Option 2: Use FIRE-1 MCP Server for AI assistant integration
# "Hey Claude, use FIRE-1 Agent to crawl this docs site and extract all API endpoints"

# Option 3: Use both in your workflow
# Complex scraping + AI analysis + MCP integration

🎉 Ready Now: FIRE-1 MCP Server provides 6 advanced tools accessible from any MCP-compatible AI assistant!

📁 Project Structure

FIRE-1-Agent/
├── fire_agent.py           # Core FIRE-1 Agent module
├── fire_mcp_server.py      # 🆕 MCP Server - AI Assistant Integration
├── setup.py               # Package setup and installation
├── requirements.txt       # Python dependencies (includes MCP)
├── env.example           # Environment configuration template
├── mcp_config.json       # MCP server configuration template
├── pytest.ini           # Test configuration
├── README.md             # This file
├── SETUP_GITHUB.md       # GitHub setup guide
├── MCP_SETUP.md          # 🆕 MCP Server setup and integration guide
├── 
├── examples/             # Example scrapers and use cases
│   ├── scrape_surveillance.py      # Surveillance Watch scraper
│   ├── surveillance_design_scraper.py  # Design inspiration scraper
│   └── surveillance_urls.txt       # Target URLs
├── 
├── scripts/              # Utility scripts
│   ├── gemini_docs_scraper.py      # Gemini API docs scraper
│   ├── run_tests.py               # Test runner
│   └── activate.sh                # Virtual environment helper
├── 
├── tests/                # Test suite
└── scraped_data/         # Output directory (auto-created)

✨ Features

🆕 Latest Firecrawl v1 Features

  • 📱 Mobile Device Emulation - Scrape mobile-specific content and responsive designs
  • 🎬 Advanced Actions - Click, scroll, input, wait, screenshot, and scrape in sequence
  • 🖼️ Enhanced Screenshots - Full-page and element-specific screenshots
  • 🔗 Advanced Iframe Scraping - Recursive iframe traversal and cross-origin handling
  • 4x Faster Markdown Parsing - Rebuilt parser for enhanced speed and reliability

Core Capabilities

  • 🔍 Individual URL Scraping - Extract content from specific URLs with retry logic
  • 🕷️ Website Crawling - Recursively crawl entire websites with smart limits
  • 🗺️ Website Mapping - Fast discovery of all URLs on a website
  • 🔬 Structured Data Extraction - LLM-powered data extraction with custom schemas
  • Actions Support - Execute complex interactions before scraping

Advanced Features

  • 🤖 Agentic Framework Ready - Optimized async interfaces for AI agent integration
  • 🚀 Batch Processing - High-performance concurrent processing with semaphores
  • 📊 Progress Tracking - Real-time progress bars and comprehensive statistics
  • 🛡️ Enterprise Error Handling - Exponential backoff retry with smart error detection
  • 💾 Multiple Output Formats - Markdown, HTML, screenshots, links, structured data
  • 🎯 Flexible Configuration - Environment variables and command-line options
  • 📈 Analytics & Reporting - Detailed statistics and success rate tracking

Output Formats

  • Markdown - Clean, LLM-ready content
  • HTML - Raw HTML content with proper formatting
  • Screenshots - Full-page and mobile screenshots
  • Links - Extracted links and navigation structure
  • Structured Data - LLM-extracted JSON with custom schemas
  • JSON/CSV - Complete metadata and analytics exports

🚀 Quick Start

1. Installation

# Clone or create your project directory
mkdir fire-agent && cd fire-agent

# Install dependencies
pip install -r requirements.txt

2. Configuration

Copy the example configuration:

cp config.env.example .env

Edit .env and add your Firecrawl API key:

# Get your API key from https://firecrawl.dev
FIRECRAWL_API_KEY=fc-YOUR_API_KEY_HERE

3. Prepare URLs

Create a URLs file with your target websites, one per line:

# Create your URLs file
echo "https://example.com/page1" > my_urls.txt
echo "https://example.com/page2" >> my_urls.txt
# Comments start with #

4. Run the Agent

For specialized scraping (surveillance example):

python examples/scrape_surveillance.py

Basic Scraping (General purpose):

python fire_agent.py --urls-file my_urls.txt

Interactive Mode:

python fire_agent.py --operation interactive

📖 Usage Guide

Basic Agent (fire_agent.py)

🆕 Unified agent with ALL Firecrawl v1 capabilities - optimized for agentic frameworks:

  • Default mode: Simple batch scraping with v1 API optimizations
  • Advanced modes: --operation crawl|map|extract|interactive
  • New features: Mobile scraping, action sequences, enhanced screenshots
  • Agentic ready: Async interfaces, capabilities discovery, batch processing
  • Enterprise grade: Exponential backoff retry, comprehensive error handling

Default usage (simple scraping):

python fire_agent.py --urls-file your_urls.txt    # Scrapes your URLs with v1 API
python fire_agent.py --urls-file your_urls.txt --concurrent 3

Advanced usage:

python fire_agent.py --operation crawl --crawl-limit 100
python fire_agent.py --operation map
python fire_agent.py --operation interactive

Agentic framework usage:

# Direct integration in your agent code
agent = FireAgent()
result = await agent.scrape_single_url(url, {'mobile': True, 'formats': ['markdown', 'screenshot']})

Specialized Scrapers (Examples)

Surveillance Watch Design Scraper:

  • Pre-configured for design inspiration scraping
  • Enhanced retry logic with exponential backoff
  • Optimized output formatting for design analysis

Usage:

python examples/surveillance_design_scraper.py              # Use default settings
python examples/scrape_surveillance.py --concurrent 3       # Adjust concurrency

Additional Utility Scripts (scripts/)

Gemini Documentation Scraper (scripts/gemini_docs_scraper.py):

  • Specialized agent optimized for API documentation scraping
  • Enhanced retry logic with exponential backoff
  • Optimized output formatting for documentation
  • Available in the scripts directory

Usage:

python scripts/gemini_docs_scraper.py              # Use default settings
python scripts/gemini_docs_scraper.py --concurrent 3    # Adjust concurrency

Configuration Options

Environment Variables (.env file):

# API Configuration
FIRECRAWL_API_KEY=fc-your-api-key

# Scraping Configuration
OUTPUT_FORMAT=markdown,html,metadata,mobile,screenshot
MAX_CONCURRENT_REQUESTS=5
DELAY_BETWEEN_REQUESTS=1.0

# Retry Configuration (NEW!)
MAX_RETRIES=3
RETRY_BASE_DELAY=2.0

# Output Configuration
OUTPUT_DIR=scraped_data
SAVE_INDIVIDUAL_FILES=true
SAVE_COMBINED_FILE=true

# Advanced Configuration
CRAWL_LIMIT=50
CRAWL_DEPTH=3
ENABLE_ACTIONS=true
ENABLE_STRUCTURED_EXTRACTION=true

🎯 Use Cases

1. Design Inspiration Scraping

Use the specialized surveillance scraper example:

python examples/surveillance_design_scraper.py

✅ Pre-configured for design analysis
✅ Retry logic for 100% success rate
✅ Optimized output formats for design inspiration

2. General Documentation Scraping

python fire_agent.py --urls-file your_urls.txt

3. Website Content Analysis

python fire_agent.py --operation crawl --crawl-limit 200

4. URL Discovery

python fire_agent.py --operation map

5. Structured Data Extraction

Extract specific data using custom schemas:

python fire_agent.py --operation interactive
# Then choose option 4 for structured extraction

6. Agentic Framework Integration

Perfect for AI agents and automation:

from fire_agent import FireAgent

# Initialize with capabilities discovery
agent = FireAgent()
capabilities = agent.get_agent_capabilities()

# Advanced action-based scraping
actions = [
    {"type": "wait", "selector": "#content"},
    {"type": "click", "selector": ".load-more"},
    {"type": "screenshot"},
    {"type": "scrape"}
]
result = await agent.scrape_with_actions(url, actions, ['markdown', 'screenshot'])

# High-performance batch processing
results = await agent.batch_scrape_async(urls, max_concurrent=10)

📁 Output Structure

The agent creates organized output in the scraped_data/ directory:

scraped_data/
├── individual_files/
│   ├── ai.google.dev_gemini-api_docs_api-key.md
│   ├── ai.google.dev_gemini-api_docs_api-key.html
│   └── ai.google.dev_gemini-api_docs_api-key_metadata.json
├── combined_scrape_20241201_143022.md
├── combined_scrape_20241201_143022.json
└── scrape_summary_20241201_143022.csv

File Types

  • .md files - Clean markdown content for each URL
  • .html files - Raw HTML content
  • _metadata.json - Page metadata (title, description, etc.)
  • combined_*.md - All content in one markdown file
  • combined_*.json - Complete data with metadata
  • summary_*.csv - Spreadsheet-friendly summary

🔧 Advanced Configuration

Custom Actions

Enable JavaScript interactions:

ENABLE_ACTIONS=true

This adds actions like:

  • Wait for page load
  • Scroll to load dynamic content
  • Click buttons
  • Fill forms

Structured Data Extraction

Enable LLM-powered data extraction:

ENABLE_STRUCTURED_EXTRACTION=true

Define custom schemas for:

  • Articles (title, author, content, tags)
  • Products (name, price, features, rating)
  • Contacts (name, email, phone, address)

Rate Limiting

Adjust request frequency:

MAX_CONCURRENT_REQUESTS=3
DELAY_BETWEEN_REQUESTS=2.0

🛡️ Error Handling & Retry Logic

The agent includes comprehensive error handling with automatic retry mechanism:

Retry Configuration

MAX_RETRIES=3           # Number of retry attempts (default: 3)
RETRY_BASE_DELAY=2.0    # Base delay in seconds (default: 2.0s)

How Retries Work

  • Exponential Backoff: 2s → 4s → 8s delays between retries
  • Smart Retry Logic: Only retries on network/timeout errors, not content issues
  • Attempt Tracking: Logs show which attempt succeeded
  • Final Failure: After max retries, records detailed error information

Error Types Handled

  • Network timeouts - Automatic retries with backoff
  • Rate limiting - Respects API limits
  • Invalid URLs - Graceful skipping
  • Content errors - Detailed logging
  • File system errors - Safe file operations

Example Retry Flow

🔍 Scraping: https://example.com (attempt 1/4)
❌ Error: Request timeout (attempt 1)
🔄 Retry attempt 1/3 for https://example.com (waiting 2s)
🔍 Scraping: https://example.com (attempt 2/4)
❌ Error: Request timeout (attempt 2)  
🔄 Retry attempt 2/3 for https://example.com (waiting 4s)
🔍 Scraping: https://example.com (attempt 3/4)
✅ Successfully scraped: https://example.com (attempt 3)

View errors in the statistics report or check individual error logs.

📊 Monitoring & Analytics

Both agents provide detailed statistics:

  • Success rates - Track scraping efficiency
  • Processing time - Monitor performance
  • Error analysis - Identify problem URLs
  • Output metrics - Count pages processed

Example statistics output:

📊 FIRE-1 Agent Scraping Statistics
┌─────────────────────┬─────────────┐
│ Metric              │ Value       │
├─────────────────────┼─────────────┤
│ Total URLs          │ 70          │
│ Successful Scrapes  │ 68          │
│ Failed Scrapes      │ 2           │
│ Duration           │ 0:02:34     │
│ Success Rate       │ 97.1%       │
└─────────────────────┴─────────────┘

🤖 Interactive Mode

The unified agent includes an interactive mode for exploratory scraping:

python fire_agent.py --operation interactive

Features:

  1. Choose operation type - Scrape, crawl, map, or extract
  2. Input URLs manually - Or load from file
  3. Configure parameters - Set limits and options
  4. Real-time feedback - See results immediately
  5. Chain operations - Map → Scrape discovered URLs

💡 Tips & Best Practices

1. Start Small

Begin with a few URLs to test configuration:

# Test with first 5 URLs
head -5 your_urls.txt > test_urls.txt
python fire_agent.py --urls-file test_urls.txt

2. Optimize Concurrency

Balance speed vs. politeness:

  • Conservative: 2-3 concurrent requests
  • Moderate: 5-8 concurrent requests
  • Aggressive: 10+ concurrent requests (use carefully)

3. Content Quality

Choose appropriate formats:

  • Markdown: Best for LLM processing
  • HTML: Preserve original formatting
  • Metadata: Extract page information

4. Large Datasets

For large URL lists:

  • Use batch processing
  • Monitor memory usage
  • Enable individual file saving
  • Set appropriate delays

5. API Limits

Firecrawl has usage limits:

  • Monitor your usage
  • Implement proper delays
  • Handle rate limiting gracefully

🔗 Firecrawl Features Used

This agent leverages Firecrawl's full capability set:

  • Scraping - scrape_url() for individual pages
  • Crawling - crawl_url() for recursive site scraping
  • Mapping - map_url() for URL discovery
  • Actions - JavaScript interactions
  • Extraction - LLM-powered structured data
  • Formats - Multiple output formats

🚨 Troubleshooting

Common Issues

1. API Key Error

❌ FIRECRAWL_API_KEY not found

Solution: Set your API key in .env file

2. Import Errors

❌ firecrawl-py not installed

Solution: Run pip install -r requirements.txt

3. File Not Found

❌ File your_urls.txt not found

Solution: Ensure your URLs file exists and has correct name

4. Empty Results

⚠️ No content returned for URL

Solution: Check URL accessibility and Firecrawl limits

Debug Mode

Enable verbose logging:

python fire_agent.py --verbose

Rate Limiting

If hitting rate limits:

DELAY_BETWEEN_REQUESTS=3.0
MAX_CONCURRENT_REQUESTS=2

🔄 Updates & Maintenance

Keep your agent updated with the latest Firecrawl v1 features:

  1. Update dependencies:

    pip install --upgrade firecrawl-py rich python-dotenv pandas
    
  2. Check Firecrawl status: Visit Firecrawl Status

  3. Monitor API usage: Check your Firecrawl Dashboard

  4. Latest v1 features: Mobile scraping, advanced actions, enhanced iframes

🧪 Testing & Quality Assurance

The FIRE-1 Agent includes a comprehensive test suite with 100% passing tests:

Test Categories

  • Unit Tests (12 tests) - Core functionality, configuration, file operations
  • Integration Tests (15 tests) - Complete workflows, error recovery, feature integration
  • Scraping Operations (12 tests) - Scrape, crawl, map, extract operations
  • Retry Logic (8 tests) - Exponential backoff, error handling, concurrency
  • Gemini Scraper (6 tests) - Specialized documentation scraping workflows

Running Tests

Quick test (unit tests only):

python scripts/run_tests.py --mode quick

Full test suite:

python scripts/run_tests.py --mode all

Specific test categories:

python scripts/run_tests.py --mode unit        # Unit tests only
python scripts/run_tests.py --mode integration # Integration tests
python scripts/run_tests.py --mode retry       # Retry logic tests
python scripts/run_tests.py --mode gemini      # Gemini-specific tests

Real API tests (use sparingly):

pytest tests/test_real_integration.py -m real_api -v

Test Coverage

  • 53 passing tests with 98% success rate
  • Professional mocking for external APIs
  • Real file operations testing
  • Async/await support with proper fixtures
  • Coverage reporting in HTML and XML formats

View coverage report:

# After running tests
open htmlcov/index.html  # View detailed coverage report

Test Infrastructure

  • pytest with async support
  • Rich output with colored progress
  • HTML reports for CI/CD integration
  • Parallel execution with pytest-xdist
  • Factory patterns for test data generation
  • Comprehensive mocking framework

The test suite ensures reliability and catches regressions, making FIRE-1 Agent production-ready.

📝 License

This project is open source. Use it freely for your scraping and agentic framework needs.

🤝 Contributing

Contributions welcome! Areas for improvement:

  • Additional agentic framework integrations
  • Enhanced mobile device profiles
  • Advanced action sequences
  • Performance optimizations
  • More structured extraction schemas

Happy Scraping with FIRE-1 Agent v2.0! 🔥

Optimized for Firecrawl v1 API • Ready for Agentic Frameworks • Enterprise-Grade Performance

For questions or issues, please check the Firecrawl documentation or create an issue in this repository.

Quick Start

1

Clone the repository

git clone https://github.com/bubroz/fire-1-agent
2

Install dependencies

cd fire-1-agent
npm install
3

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownerbubroz
Repofire-1-agent
LanguagePython
License-
Last fetched8/10/2025

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat
🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas
🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata
🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation