-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H-3528, H-3529: Set up Pdfium & preprocess PDFs as images #5512
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5512 +/- ##
=======================================
Coverage 19.83% 19.83%
=======================================
Files 515 515
Lines 17327 17327
Branches 2548 2548
=======================================
Hits 3437 3437
Misses 13852 13852
Partials 38 38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JesusFileto!
I had a look through the PR and so far it really looks good! I have a few minor suggestions, but really nothing critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! 👋 I am a contributor to HASH working on a couple of things and got curious when I saw this PR, so I thought I'd leave some comments. Don't feel any pressure to implement any of these, just hoping you'll find these helpful in some way! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a minor suggestion on where to put snapshots and added comments to the .github/
files I changed. Having arbitrary folders in src/
might be misleading as typically, every directory inside of src
is an actual module.
I think we can safe bigger refactoring such as moving Pdfium
to a struct as we discussed offline for a later PR to get this PR over the line.
.github/workflows/test.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a step for the test-workflow to download the .so
file to link dynamically.
🌟 What is the purpose of this PR?
This intial PR serves as the first small step in the segmenting, chunking, and embedding package Chonky. Currently, we are setting up the environment to receive a file path to a pdf, and preprocess the pdf into images using
Pdfium-render
that will be used for structural embeddings later on.This PR also sets up the
Pdfium-render
to be used for text extraction.🔗 Related links
(internal)
Implementation Doc
🚫 Blocked by
N/A
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
error-stack
for error handling🐾 Next steps
pdfium-render
and enriching schema with document layout informationclap
for parsing CLI arguments🛡 What tests cover this?
❓ How to test this?
cargo run tests/docs/test-doc.pdf
will save the pdf images to theout
direction in the Chonky package📹 Demo