Skip to content

Scheduled Firecrawl Action

Latest
Compare
Choose a tag to compare
@cameronking4 cameronking4 released this 14 Jan 00:25
· 12 commits to main since this release

🚀 Introducing Automated Web Crawling

  • This action uses Firecrawl to crawl and scrape a dedicated link and export contents as markdown
  • Install and Configure the Github Action to commit output directly into your Github repo automatically
  • Saves the generated content into the /knowledge_bases folder for seamless version control.

name: Scheduled Crawl Action

# This workflow will automatically crawl the specified URL on a schedule and commit the results to your repository.

on:
  schedule:
    - cron: '0 0 * * *'  # Replace with the cron expression for the schedule you want to use (e.g., '0 0 * * *' for daily at midnight UTC)
  workflow_dispatch:  # Allow manual triggering

jobs:
  crawl:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      id-token: write
      actions: read

    steps:
      - uses: actions/checkout@v4
      
      - name: Run Firecrawl Action
        uses: @cameronking4/nextjs-firecrawl-starter@v1.0.0
        with:
          url: 'https://news.ycombinator.com' # Replace with the URL you want to crawl regularly
          output_folder: 'knowledge_bases' # Replace with the folder name where the output commits will be saved
          api_url: 'https://nextjs-firecrawl-starter.vercel.app' # Replace with the API URL of your Firecrawl API endpoint, this is the default URL for the starter app.

📅 Scheduled Crawling with GitHub Actions

  • Integrated a scheduled workflow to automatically run the crawler at defined intervals.
  • Default schedule: Daily at midnight (UTC) (0 0 * * *).
  • Keeps documentation continuously updated without manual intervention.

🛠 How It Works

Trigger:

  • The workflow runs automatically on schedule or manually via GitHub Actions.

Process:

  1. Checks out the repository.
  2. Runs the Firecrawl crawler.
  3. Commits and pushes updated markdown files to /knowledge_bases.

📂 New Files Introduced

  • .github/workflows/crawl-docs.yml: Defines the scheduled crawling process.
  • action.yml: Encapsulates the custom crawling logic for easy reuse and integration.

⚙️ Customization

  • Adjust the crawl frequency by editing the cron expression in crawl-docs.yml.
  • Run the action manually for instant documentation updates.

📈 Benefits

  • Fully Automated: No need to manually trigger the crawl.
  • Always Updated: Ensures the knowledge base stays current.
  • Seamless Integration: Easily extend or modify the workflow to suit project needs.

Full Changelog: https://github.com/cameronking4/nextjs-firecrawl-starter/commits/v1.0.0