This repository contains a simple python script: a web crawler for scraping documentation websites or specific sections of websites based on a URL pattern. The script downloads the content, converts it to markdown, and saves it locally while preserving the site's structure and converting internal links to relative markdown links.
This tool is particularly useful in the context of AI-driven development. By downloading and converting online documentation into a local, structured markdown format, you can:
- Provide Context to LLMs: When using AI code editors like Roo or Cline, you can point them to the local markdown files, allowing them to reference the indexed documentation directly. This can greatly improve the AI's accuracy when working with specific APIs or libraries.
- Offline Development: Access documentation locally without needing constant internet access, useful for offline work or environments with restricted connectivity.
- Consistency: Ensure that the AI is referencing a specific, consistent version of the documentation.
Example Use Case:
Imagine you are building an application using a complex API (e.g., the Stripe API). You can use this scraper with the pattern https://docs.stripe.com/api*
to download the entire API reference.
Then, when interacting with an AI coding assistant like Roo or Cline, simply point it to the local markdown files by adding the following to the system prompt:
The Stripe documentation can be found in the /docs folder in a series of indexed markdown files.
- HTML to Markdown Conversion: Extracts the main content from pages and converts it to clean markdown.
- Relative Link Preservation: Converts links between scraped pages into relative markdown links, allowing local navigation of the downloaded documentation.
- Directory Structure Preservation: Saves the markdown files within nested directories, mirroring the path structure of the original website.
- Pattern-Based Crawling: Scrapes only the pages matching a user-defined URL pattern (supports
*
wildcard). - Asynchronous Operation: Uses
asyncio
andaiohttp
for efficient, concurrent downloading of pages.
Install dependencies using:
pip install -r requirements.txt
Run the script:
python main.py
The script will prompt you to enter:
- The root URL: The starting page to begin crawling from (e.g.,
https://code.visualstudio.com/api
). - The URL pattern: The pattern to match pages you want to scrape (e.g.,
https://code.visualstudio.com/api*
).
The scraped markdown files will be saved in the docs/
directory.
This project is licensed under the terms of the MIT LICENSE.