Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapix V2 #101

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e49c556
change to cheerio for scraping, keep puppeteer for crawling.
qdequele Sep 22, 2024
5f461e7
big update
qdequele Oct 12, 2024
aca8cb1
merge the maximum code of the crawlers together
qdequele Oct 12, 2024
896e6c1
another big commit.
qdequele Oct 13, 2024
45a203f
add markdown scraper
qdequele Oct 13, 2024
6195809
add custom scraper
qdequele Oct 13, 2024
60f2416
remove startCrawl; comment playwright
qdequele Oct 19, 2024
b3db1bb
fix #99
qdequele Nov 8, 2024
550a4ab
update packages
qdequele Nov 8, 2024
6e17d28
fix #56: Throw error when redis server is not answering
qdequele Nov 8, 2024
a79061b
fix #48: add the automatic detection of 404 pages to skip with the po…
qdequele Nov 8, 2024
536326a
By default use cheerio instead of Puppeteer #113
qdequele Nov 9, 2024
ad970e6
fix #112: Remove the useless headless option
qdequele Nov 9, 2024
49324fc
remove launcher_option and launcher
qdequele Nov 9, 2024
c2ba9bf
Update Documentation
qdequele Nov 9, 2024
a43ee93
fix #103: Keep the previous settings
qdequele Nov 10, 2024
f5e9944
fix #102: Load the sitemap as starter point for crawling.
qdequele Nov 10, 2024
8e1adef
add a new playground
qdequele Nov 15, 2024
62b8dce
extract sitemap
qdequele Nov 15, 2024
2105cc7
add pdf scraper
qdequele Nov 15, 2024
81be6b4
Update testing
qdequele Nov 15, 2024
27db8e4
add the pdfs on the playground
qdequele Nov 22, 2024
abe215f
add a lot of pages
qdequele Nov 22, 2024
ec4dd44
make docker works for playground, scrapix and meilisearch
qdequele Nov 30, 2024
a836dd8
full working base with zod #35
qdequele Nov 30, 2024
7837b39
fix tests
qdequele Nov 30, 2024
1234229
update github CI
qdequele Nov 30, 2024
713a6a6
update test CI
qdequele Nov 30, 2024
55a3d0e
Update Node.js version in GitHub Actions workflow from 18 to 20
qdequele Nov 30, 2024
8d73bbb
Make wait-for-it.sh executable in GitHub Actions workflow
qdequele Nov 30, 2024
36bde9e
Remove deprecated configuration files for previous tests
qdequele Nov 30, 2024
6ec776c
Refactor BaseTest and ScraperTestHelper to streamline index UID handling
qdequele Nov 30, 2024
f1a394b
use start_urls as crawling pages
qdequele Jan 20, 2025
9c1192f
remove unecessary pagination detection leading to avoid scraping inte…
qdequele Jan 20, 2025
d88def5
update and simplify scraper
qdequele Jan 20, 2025
a90004f
Improve Meilisearch index settings handling and remove debug logging
qdequele Feb 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .eslintrc.cjs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ module.exports = {
'@typescript-eslint/return-await': 'off',
'@typescript-eslint/no-explicit-any': 'off',
'@typescript-eslint/explicit-function-return-type': 'off',
"@typescript-eslint/no-unsafe-assignment": "off",
'@typescript-eslint/member-delimiter-style': [
'error',
{
Expand Down
40 changes: 0 additions & 40 deletions .github/scripts/scrapix_server_call_check.sh

This file was deleted.

49 changes: 49 additions & 0 deletions .github/scripts/wait-for-it.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env bash
# Use: ./wait-for-it.sh host:port [-t timeout] [-- command args]
# From: https://github.com/vishnubob/wait-for-it

WAITFORIT_cmdname=${0##*/}

echoerr() { if [[ $WAITFORIT_QUIET -ne 1 ]]; then echo "$@" 1>&2; fi }

usage()
{
cat << USAGE >&2
Usage:
$WAITFORIT_cmdname host:port [-t timeout] [-- command args]
-h HOST | --host=HOST Host or IP under test
-p PORT | --port=PORT TCP port under test
-t TIMEOUT | --timeout=TIMEOUT Timeout in seconds, zero for no timeout
-- COMMAND ARGS Execute command with args after the test finishes
USAGE
exit 1
}

wait_for()
{
if [[ $WAITFORIT_TIMEOUT -gt 0 ]]; then
echoerr "$WAITFORIT_cmdname: waiting $WAITFORIT_TIMEOUT seconds for $WAITFORIT_HOST:$WAITFORIT_PORT"
else
echoerr "$WAITFORIT_cmdname: waiting for $WAITFORIT_HOST:$WAITFORIT_PORT without a timeout"
fi
WAITFORIT_start_ts=$(date +%s)
while :
do
if [[ $WAITFORIT_ISBUSY -eq 1 ]]; then
nc -z $WAITFORIT_HOST $WAITFORIT_PORT
WAITFORIT_result=$?
else
(echo -n > /dev/tcp/$WAITFORIT_HOST/$WAITFORIT_PORT) >/dev/null 2>&1
WAITFORIT_result=$?
fi
if [[ $WAITFORIT_result -eq 0 ]]; then
WAITFORIT_end_ts=$(date +%s)
echoerr "$WAITFORIT_cmdname: $WAITFORIT_HOST:$WAITFORIT_PORT is available after $((WAITFORIT_end_ts - WAITFORIT_start_ts)) seconds"
break
fi
sleep 1
done
return $WAITFORIT_result
}

# Rest of the script...
73 changes: 73 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
name: Test

on:
pull_request:
branches: [main]
push:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: "20"
cache: "npm"

- name: Install dependencies
run: npm ci

- name: Build
run: npm run build

- name: Install Docker Compose
run: |
sudo apt-get update
sudo apt-get install -y docker-compose

- name: Start test environment
run: |
docker-compose up -d
docker ps -a

- name: Make wait-for-it.sh executable
run: chmod +x .github/scripts/wait-for-it.sh

- name: Wait for services
run: |
.github/scripts/wait-for-it.sh localhost:7700 -t 60
.github/scripts/wait-for-it.sh localhost:3000 -t 60
.github/scripts/wait-for-it.sh localhost:8080 -t 60
sleep 10 # Give services extra time to fully initialize

- name: Debug service logs
if: always()
run: |
echo "=== Meilisearch Logs ==="
docker-compose logs meilisearch
echo "=== Playground Logs ==="
docker-compose logs playground
echo "=== Scraper Logs ==="
docker-compose logs scraper
echo "=== Redis Logs ==="
docker-compose logs redis

- name: Run tests
run: npm run test

- name: Show test logs on failure
if: failure()
run: |
echo "=== Service Status ==="
docker-compose ps
echo "=== Recent Logs ==="
docker-compose logs --tail=100

- name: Cleanup
if: always()
run: docker-compose down -v
77 changes: 0 additions & 77 deletions .github/workflows/tests.yml

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ typings/
# dotenv environment variables file
.env
.env.test
.env.local

# parcel-bundler cache (https://parceljs.org/)
.cache
Expand Down
12 changes: 6 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Specify the base Docker image. You can read more about
# the available images at https://crawlee.dev/docs/guides/docker-images
# You can also use any other image from Docker Hub.
FROM apify/actor-node-puppeteer-chrome:18 AS builder
FROM apify/actor-node-puppeteer-chrome:20 AS builder

# Copy just package.json and package-lock.json
# to speed up the build using Docker layer cache.
COPY --chown=myuser package*.json ./

# Install all dependencies. Don't audit to speed up the installation.
RUN yarn install --production=false
RUN npm install --include=dev

# Next, copy the source files using the user set
# in the base image.
COPY --chown=myuser . ./

# Install all dependencies and build the project.
# Don't audit to speed up the installation.
RUN yarn run build
RUN npm run build

# Create final image
FROM apify/actor-node-puppeteer-chrome:18
FROM apify/actor-node-puppeteer-chrome:20

# Copy only built JS files from builder image
COPY --from=builder --chown=myuser /home/myuser/dist ./dist
Expand All @@ -31,7 +31,7 @@ COPY --chown=myuser package*.json ./
# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN yarn install --production=false
RUN npm install

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
Expand All @@ -40,4 +40,4 @@ COPY --chown=myuser . ./

# Run the image. If you know you won't need headful browsers,
# you can remove the XVFB start script for a micro perf gain.
CMD ./start_xvfb_and_run_cmd.sh && yarn start:prod -- -c $CRAWLER_CONFIG -b /usr/bin/google-chrome --silent
CMD ./start_xvfb_and_run_cmd.sh && npm run start:server -- -c $CRAWLER_CONFIG -b /usr/bin/google-chrome --silent
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,7 @@ data:
"meilisearch_url": "http://localhost:7700",
"meilisearch_api_key": "masterKey",
"meilisearch_index_uid": "google",
"strategy": "default", // docssearch, schema*, custom or default
"headless": true, // Use headless browser for rendering javascript websites
"strategy": "default", // docssearch, schema*, custom, markdown or default
"batch_size": 1000, // pass null to send documents 1 at a time or specify a batch size
"primary_key": null,
"meilisearch_settings": {
Expand All @@ -52,6 +51,12 @@ data:
"filterableAttributes": ["urls_tags"],
"distinctAttribute": "url"
},
"selectors": { // Only for custom
"main_content": "main",
"headings": "h1, h2, h3",
"paragraphs": "p",
"custom_field": ".custom-class",
},
"schema_settings": {
"only_type": "Product", // Product, Article, etc...
"convert_dates": true // default false
Expand Down Expand Up @@ -159,10 +164,6 @@ Name of the index on which the content is indexed.
default: `default`
Scraping strategy: - `default` Scrapes the content of webpages, it is suitable for most use cases. It indexes the content in this format (show example) - `docssearch` Scrapes the content of webpages, it suits most use cases. The difference with the default strategy is that it indexes the content in a format compatible with docs-search bar - `schema` Scraps the [`schema`](https://schema.org/) information of your web app.

`headless`
default: `true`
Wether or not the javascript should be loaded before scraping starts.

`primary_key`
The key name in your documents containing their unique identifier.

Expand Down
5 changes: 0 additions & 5 deletions config/nodemon:build.json

This file was deleted.

5 changes: 0 additions & 5 deletions config/nodemon:default-scrap.json

This file was deleted.

5 changes: 0 additions & 5 deletions config/nodemon:docsearch-scrap.json

This file was deleted.

27 changes: 0 additions & 27 deletions docker-compose.dev.yml

This file was deleted.

Loading