🚀 Introducing Automated Web Crawling
- This action uses Firecrawl to crawl and scrape a dedicated link and export contents as markdown
- Install and Configure the Github Action to commit output directly into your Github repo automatically
- Saves the generated content into the
/knowledge_bases
folder for seamless version control.
name: Scheduled Crawl Action
# This workflow will automatically crawl the specified URL on a schedule and commit the results to your repository.
on:
schedule:
- cron: '0 0 * * *' # Replace with the cron expression for the schedule you want to use (e.g., '0 0 * * *' for daily at midnight UTC)
workflow_dispatch: # Allow manual triggering
jobs:
crawl:
runs-on: ubuntu-latest
permissions:
contents: write
id-token: write
actions: read
steps:
- uses: actions/checkout@v4
- name: Run Firecrawl Action
uses: @cameronking4/nextjs-firecrawl-starter@v1.0.0
with:
url: 'https://news.ycombinator.com' # Replace with the URL you want to crawl regularly
output_folder: 'knowledge_bases' # Replace with the folder name where the output commits will be saved
api_url: 'https://nextjs-firecrawl-starter.vercel.app' # Replace with the API URL of your Firecrawl API endpoint, this is the default URL for the starter app.
📅 Scheduled Crawling with GitHub Actions
- Integrated a scheduled workflow to automatically run the crawler at defined intervals.
- Default schedule: Daily at midnight (UTC) (
0 0 * * *
). - Keeps documentation continuously updated without manual intervention.
🛠 How It Works
Trigger:
- The workflow runs automatically on schedule or manually via GitHub Actions.
Process:
- Checks out the repository.
- Runs the Firecrawl crawler.
- Commits and pushes updated markdown files to
/knowledge_bases
.
📂 New Files Introduced
.github/workflows/crawl-docs.yml
: Defines the scheduled crawling process.action.yml
: Encapsulates the custom crawling logic for easy reuse and integration.
⚙️ Customization
- Adjust the crawl frequency by editing the cron expression in
crawl-docs.yml
. - Run the action manually for instant documentation updates.
📈 Benefits
- Fully Automated: No need to manually trigger the crawl.
- Always Updated: Ensures the knowledge base stays current.
- Seamless Integration: Easily extend or modify the workflow to suit project needs.
Full Changelog: https://github.com/cameronking4/nextjs-firecrawl-starter/commits/v1.0.0