MCP Servernoamstpublic

job_scraper

Scraping new career pages dynamically

Repository Info

Stars

Forks

Watchers

Issues

Python

Language

License

View on GitHubGitHub Download DocumentationDocs

About This Server

Scraping new career pages dynamically

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

🧠 LastStartup Job Scraper

A robust, intelligent job scraping microservice that dynamically learns and caches job listing structures from startup career pages — even those styled with Tailwind CSS. Built for scalability, automation, and precision.

🚀 Features

🔎 Company Lookup: Matches company names to their careers URL from a base index.
🧱 Schema Inference:
- Converts deep Tailwind-style CSS selectors into XPath.
- Uses ask_llm_for_content() + extract_consistent_selectors() to infer job card structure dynamically.
🧠 LLM-Fallback: If no schema exists, uses a language model to understand the HTML and derive a reusable structure.
💾 Schema Cache: Saves learned schemas per domain to a persistent job_structure_cache.json.
⚙️ MCP Tool Integration: Exposed as an MCP tool via @mcp.tool() for plug-and-play use in autonomous agents or pipelines.

🧩 Project Structure

├── jobs_scraper/
│ ├── laststartupScraping.py # Core logic: scraping, schema inference, XPath logic
│ ├── job_structure_cache.json # Persistent schema storage per domain
│ └── ...
├── jobs_scraper.py # Entrypoint with @mcp.tool(get_jobs)
└── README.md

Claude Desktop Integration

To run this job scraper as a tool inside Claude for Desktop, follow these steps:

Download and install the Claude Desktop app for your system.

Inside your Claude app configuration folder, add or update the following section in your claude_desktop_config.json file:

"jobs_scraper": {
    "command": "/Users/noamstopler/.local/bin/uv",
    "args": [
        "--directory",
        "/Users/noamstopler/Desktop/myProjects/RelevantJobsMCPServer/jobs_scraper",
        "run",
        "jobs_scraper.py"
    ],
    "env": {
        "GROQ_API_KEY": "your_api_key"
    }
}

Once configured, restart Claude Desktop. Your jobs_scraper tool should now be available as an integrated tool Claude can call via @mcp.tool().

⚠️ Disclaimer: Early Version This is a very early-stage prototype of the job scraper.

🧪 It has been tested on only a small number of career pages so far.

📄 The structure inference logic works best on sites with well-defined HTML and consistent Tailwind-style class patterns.

💥 Some sites may still fail due to dynamic content, non-standard layouts, or aggressive bot protection.

Expect bugs, edge cases, and limitations

🧱 Token Limitations with Free API Keys Some career pages are very large or contain deeply nested HTML, which makes them difficult to process using free-tier LLM API keys.

🔐 Free Groq/OpenAI API keys typically have token limits (e.g. 4K–8K tokens).

🧠 When using ask_llm_for_content() to analyze such pages, the model may truncate input or fail to respond fully.

🗂 Data Source This scraper currently pulls its list of companies from:

🔗 https://www.lastartup.co.il/funding

The site lists Israeli startups and companies that recently raised funding.

Each company entry typically includes a name and a link to their careers page, which this tool uses as the starting point for scraping job listings.

If the company has no accessible careers page, it is skipped.

📌 Note: Support is currently limited to companies listed on this source. Future versions may expand to support additional directories or manual entry.

🛠 Limitations & Roadmap 🔍 Careers Page Discovery (Current vs. Planned) In this prototype version, the scraper naively assumes that the careers page is located at: https://

❌ Fails when companies use non-standard paths (e.g. /jobs, /join-us, /work-with-us, etc.)

🧠 Planned Improvement In future versions, we plan to implement:

📄 Sitemap Parsing with LLMs:

    Automatically detect and analyze the site's sitemap.xml

    Use a language model to infer the most likely path to the careers or jobs page

    Support fallback to subdomain-based careers pages (e.g. jobs.company.com)

    This will significantly improve accuracy and robustness when locating hiring information — especially for larger or more modern websites.

Quick Start

Clone the repository

git clone https://github.com/noamst/job_scraper

Install dependencies

cd job_scraper
npm install

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownernoamst

Repojob_scraper

LanguagePython

License-

Last fetched8/10/2025

Quick Links

Issues

Releases

License

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat

🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas

🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata

🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

⚡

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation