noamst
MCP Servernoamstpublic

job_scraper

Scraping new career pages dynamically

Repository Info

0
Stars
0
Forks
0
Watchers
0
Issues
Python
Language
-
License

About This Server

Scraping new career pages dynamically

Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.

Documentation

🧠 LastStartup Job Scraper

A robust, intelligent job scraping microservice that dynamically learns and caches job listing structures from startup career pages — even those styled with Tailwind CSS. Built for scalability, automation, and precision.

🚀 Features

  • 🔎 Company Lookup: Matches company names to their careers URL from a base index.
  • 🧱 Schema Inference:
    • Converts deep Tailwind-style CSS selectors into XPath.
    • Uses ask_llm_for_content() + extract_consistent_selectors() to infer job card structure dynamically.
  • 🧠 LLM-Fallback: If no schema exists, uses a language model to understand the HTML and derive a reusable structure.
  • 💾 Schema Cache: Saves learned schemas per domain to a persistent job_structure_cache.json.
  • ⚙️ MCP Tool Integration: Exposed as an MCP tool via @mcp.tool() for plug-and-play use in autonomous agents or pipelines.

🧩 Project Structure

├── jobs_scraper/
│ ├── laststartupScraping.py # Core logic: scraping, schema inference, XPath logic
│ ├── job_structure_cache.json # Persistent schema storage per domain
│ └── ...
├── jobs_scraper.py # Entrypoint with @mcp.tool(get_jobs)
└── README.md

Claude Desktop Integration

To run this job scraper as a tool inside Claude for Desktop, follow these steps:

Download and install the Claude Desktop app for your system.

Inside your Claude app configuration folder, add or update the following section in your claude_desktop_config.json file:

"jobs_scraper": {
    "command": "/Users/noamstopler/.local/bin/uv",
    "args": [
        "--directory",
        "/Users/noamstopler/Desktop/myProjects/RelevantJobsMCPServer/jobs_scraper",
        "run",
        "jobs_scraper.py"
    ],
    "env": {
        "GROQ_API_KEY": "your_api_key"
    }
}

Once configured, restart Claude Desktop. Your jobs_scraper tool should now be available as an integrated tool Claude can call via @mcp.tool().

⚠️ Disclaimer: Early Version This is a very early-stage prototype of the job scraper.

🧪 It has been tested on only a small number of career pages so far.

📄 The structure inference logic works best on sites with well-defined HTML and consistent Tailwind-style class patterns.

💥 Some sites may still fail due to dynamic content, non-standard layouts, or aggressive bot protection.

Expect bugs, edge cases, and limitations

🧱 Token Limitations with Free API Keys Some career pages are very large or contain deeply nested HTML, which makes them difficult to process using free-tier LLM API keys.

🔐 Free Groq/OpenAI API keys typically have token limits (e.g. 4K–8K tokens).

🧠 When using ask_llm_for_content() to analyze such pages, the model may truncate input or fail to respond fully.

🗂 Data Source This scraper currently pulls its list of companies from:

🔗 https://www.lastartup.co.il/funding

The site lists Israeli startups and companies that recently raised funding.

Each company entry typically includes a name and a link to their careers page, which this tool uses as the starting point for scraping job listings.

If the company has no accessible careers page, it is skipped.

📌 Note: Support is currently limited to companies listed on this source. Future versions may expand to support additional directories or manual entry.

🛠 Limitations & Roadmap 🔍 Careers Page Discovery (Current vs. Planned) In this prototype version, the scraper naively assumes that the careers page is located at: https://

❌ Fails when companies use non-standard paths (e.g. /jobs, /join-us, /work-with-us, etc.)

🧠 Planned Improvement In future versions, we plan to implement:

📄 Sitemap Parsing with LLMs:

    Automatically detect and analyze the site's sitemap.xml

    Use a language model to infer the most likely path to the careers or jobs page

    Support fallback to subdomain-based careers pages (e.g. jobs.company.com)

    This will significantly improve accuracy and robustness when locating hiring information — especially for larger or more modern websites.

Quick Start

1

Clone the repository

git clone https://github.com/noamst/job_scraper
2

Install dependencies

cd job_scraper
npm install
3

Follow the documentation

Check the repository's README.md file for specific installation and usage instructions.

Repository Details

Ownernoamst
Repojob_scraper
LanguagePython
License-
Last fetched8/10/2025

Recommended MCP Servers

💬

Discord MCP

Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.

integrationsdiscordchat
🔗

Knit MCP

Connect AI agents to 200+ SaaS applications and automate workflows.

integrationsautomationsaas
🕷️

Apify MCP Server

Deploy and interact with Apify actors for web scraping and data extraction.

apifycrawlerdata
🌐

BrowserStack MCP

BrowserStack MCP Server for automated testing across multiple browsers.

testingqabrowsers

Zapier MCP

A Zapier server that provides automation capabilities for various apps.

zapierautomation