Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H-3528, H-3529: Set up Pdfium & preprocess PDFs as images #5512

Merged
merged 32 commits into from
Nov 16, 2024

Conversation

JesusFileto
Copy link
Member

@JesusFileto JesusFileto commented Oct 29, 2024

🌟 What is the purpose of this PR?

This intial PR serves as the first small step in the segmenting, chunking, and embedding package Chonky. Currently, we are setting up the environment to receive a file path to a pdf, and preprocess the pdf into images using Pdfium-render that will be used for structural embeddings later on.

This PR also sets up the Pdfium-render to be used for text extraction.

🔗 Related links

(internal)
Implementation Doc

🚫 Blocked by

N/A

🔍 What does this change?

  • Loading Pdfs given a local file path
  • Convert Pdfs to page by page images

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • modifies a Cargo-publishable library, but it is not yet ready to publish

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

⚠️ Known issues

  • Need for proper implementation and use of error-stack for error handling

🐾 Next steps

  • Extracting textual information from pdfs using pdfium-render and enriching schema with document layout information
  • Implement clap for parsing CLI arguments

🛡 What tests cover this?

  • New tests were created

❓ How to test this?

  • running the command cargo run tests/docs/test-doc.pdf will save the pdf images to the out direction in the Chonky package

📹 Demo

@CLAassistant
Copy link

CLAassistant commented Oct 29, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests area/libs > chonky Affects the `chonky` crate (library) labels Oct 29, 2024
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
@TimDiekmann TimDiekmann self-requested a review October 29, 2024 07:43
@vilkinsons vilkinsons changed the title Implement Preprocessing of Pdfs Into Images + Setup Pdfium H-3528, H-3529: Set up Pdfium & preprocess PDFs as images Oct 29, 2024
@vilkinsons vilkinsons requested a review from CiaranMn October 29, 2024 09:14
@github-actions github-actions bot added the area/deps Relates to third-party dependencies (area) label Oct 29, 2024
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/main.rs Fixed Show fixed Hide fixed
libs/chonky/src/main.rs Fixed Show fixed Hide fixed
libs/chonky/src/main.rs Fixed Show fixed Hide fixed
libs/chonky/src/main.rs Fixed Show fixed Hide fixed
libs/chonky/src/main.rs Fixed Show fixed Hide fixed
Copy link

codecov bot commented Oct 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 19.83%. Comparing base (24f2260) to head (3e5647c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5512   +/-   ##
=======================================
  Coverage   19.83%   19.83%           
=======================================
  Files         515      515           
  Lines       17327    17327           
  Branches     2548     2548           
=======================================
  Hits         3437     3437           
  Misses      13852    13852           
  Partials       38       38           
Flag Coverage Δ
apps.hash-ai-worker-ts 1.38% <ø> (ø)
apps.hash-api 1.17% <ø> (ø)
blockprotocol.type-system 47.40% <ø> (ø)
local.hash-backend-utils 8.80% <ø> (ø)
local.hash-graph-sdk 100.00% <ø> (ø)
local.hash-isomorphic-utils 1.05% <ø> (ø)
local.hash-subgraph 24.54% <ø> (ø)
rust.deer 6.66% <ø> (ø)
rust.error-stack 72.51% <ø> (ø)
rust.sarif 87.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@TimDiekmann TimDiekmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JesusFileto!

I had a look through the PR and so far it really looks good! I have a few minor suggestions, but really nothing critical.

libs/chonky/Cargo.toml Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/src/main.rs Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Fixed Show fixed Hide fixed
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/libs/libpdfium.dylib Outdated Show resolved Hide resolved
Copy link
Member

@indietyp indietyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! 👋 I am a contributor to HASH working on a couple of things and got curious when I saw this PR, so I thought I'd leave some comments. Don't feel any pressure to implement any of these, just hoping you'll find these helpful in some way! 😊

Cargo.toml Outdated Show resolved Hide resolved
libs/chonky/libs/libpdfium.dylib Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/tests/docs/test-doc.pdf Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
libs/chonky/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@TimDiekmann TimDiekmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor suggestion on where to put snapshots and added comments to the .github/ files I changed. Having arbitrary folders in src/ might be misleading as typically, every directory inside of src is an actual module.

I think we can safe bigger refactoring such as moving Pdfium to a struct as we discussed offline for a later PR to get this PR over the line.

.github/actions/build-docker-images/action.yml Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a step for the test-workflow to download the .so file to link dynamically.

libs/chonky/src/snapshots/chonky__tests__page_1.snap Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deps Relates to third-party dependencies (area) area/infra Relates to version control, CI, CD or IaC (area) area/libs > chonky Affects the `chonky` crate (library) area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests
Development

Successfully merging this pull request may close these issues.

6 participants