Skip to content

Does the DoclingNodeParser process use the OCR library for parsing? #458

Closed Answered by vagenas
tahitimoon asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @tahitimoon, for context, I assume you are referring to the DoclingReader & DoclingNodeParser from our LlamaIndex integration.

TLDR

While OCR may be invoked in certain cases, there is a broader workflow in place here.

Details

Here is what is happening under the hood in your snippet:

  1. DoclingReader uses Docling to parse the passed file into a DoclingDocument (see Architecture)
    • in case of PDF, this step may use OCR, depending on the file & your pipeline options
  2. DoclingReader then serializes the DoclingDocument as JSON into a LlamaIndex Document
  3. DoclingNodeParser reads the DoclingDocument from the LlamaIndex Document and, using the configured chunker (by default HierarchicalChunker), p…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by tahitimoon
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants