DeepDocs

This repository contains a simple python script: a web crawler for scraping documentation websites or specific sections of websites based on a URL pattern. The script downloads the content, converts it to markdown, and saves it locally while preserving the site's structure and converting internal links to relative markdown links.

Usage in AI Development

This tool is particularly useful in the context of AI-driven development. By downloading and converting online documentation into a local, structured markdown format, you can:

Provide Context to LLMs: When using AI code editors like Roo or Cline, you can point them to the local markdown files, allowing them to reference the indexed documentation directly. This can greatly improve the AI's accuracy when working with specific APIs or libraries.
Offline Development: Access documentation locally without needing constant internet access, useful for offline work or environments with restricted connectivity.
Consistency: Ensure that the AI is referencing a specific, consistent version of the documentation.

Example Use Case:

Imagine you are building an application using a complex API (e.g., the Stripe API). You can use this scraper with the pattern https://docs.stripe.com/api* to download the entire API reference. Then, when interacting with an AI coding assistant like Roo or Cline, simply point it to the local markdown files by adding the following to the system prompt:

The Stripe documentation can be found in the /docs folder in a series of indexed markdown files.

Features

HTML to Markdown Conversion: Extracts the main content from pages and converts it to clean markdown.
Relative Link Preservation: Converts links between scraped pages into relative markdown links, allowing local navigation of the downloaded documentation.
Directory Structure Preservation: Saves the markdown files within nested directories, mirroring the path structure of the original website.
Pattern-Based Crawling: Scrapes only the pages matching a user-defined URL pattern (supports * wildcard).
Asynchronous Operation: Uses asyncio and aiohttp for efficient, concurrent downloading of pages.

Usage

Install dependencies using:

pip install -r requirements.txt

Run the script:

python main.py

The script will prompt you to enter:

The root URL: The starting page to begin crawling from (e.g., https://code.visualstudio.com/api).
The URL pattern: The pattern to match pages you want to scrape (e.g., https://code.visualstudio.com/api*).

The scraped markdown files will be saved in the docs/ directory.

License

This project is licensed under the terms of the MIT LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepDocs

Usage in AI Development

Features

Usage

License

About

Releases

Packages

Languages

License

heyseth/DeepDocs

Folders and files

Latest commit

History

Repository files navigation

DeepDocs

Usage in AI Development

Features

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages