Does the DoclingNodeParser process use the OCR library for parsing? #458

tahitimoon · 2024-11-28T09:30:36Z

tahitimoon
Nov 28, 2024

Does the DoclingNodeParser process use the OCR library for parsing?

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_url)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)

Answered by vagenas

Nov 28, 2024

Hi @tahitimoon, for context, I assume you are referring to the DoclingReader & DoclingNodeParser from our LlamaIndex integration.

TLDR

While OCR may be invoked in certain cases, there is a broader workflow in place here.

Details

Here is what is happening under the hood in your snippet:

DoclingReader uses Docling to parse the passed file into a DoclingDocument (see Architecture)
- in case of PDF, this step may use OCR, depending on the file & your pipeline options
DoclingReader then serializes the DoclingDocument as JSON into a LlamaIndex Document
DoclingNodeParser reads the DoclingDocument from the LlamaIndex Document and, using the configured chunker (by default HierarchicalChunker), p…

View full answer

vagenas · 2024-11-28T10:12:46Z

vagenas
Nov 28, 2024
Maintainer

Hi @tahitimoon, for context, I assume you are referring to the DoclingReader & DoclingNodeParser from our LlamaIndex integration.

TLDR

While OCR may be invoked in certain cases, there is a broader workflow in place here.

Details

Here is what is happening under the hood in your snippet:

DoclingReader uses Docling to parse the passed file into a DoclingDocument (see Architecture)
- in case of PDF, this step may use OCR, depending on the file & your pipeline options
DoclingReader then serializes the DoclingDocument as JSON into a LlamaIndex Document
DoclingNodeParser reads the DoclingDocument from the LlamaIndex Document and, using the configured chunker (by default HierarchicalChunker), parses it to multiple chunks which it yields as LlamaIndex BaseNodes

0 replies

tahitimoon · 2024-11-28T14:06:09Z

tahitimoon
Nov 28, 2024
Author

Thank you for your response. I am currently parsing a PDF file. In order to better customize metadata in LlamaIndex, this is how I am using it now. It seems that there is no way to customize the pipeline options here either.

from llama_index.node_parser.docling import DoclingNodeParser
from llama_index.readers.docling import DoclingReader
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_url)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)
documents = []
for node in nodes:
    metadata = {
        "page": str(node.metadata["doc_items"][0]["prov"][0]["page_no"]),
        "bbox": node.metadata["doc_items"][0]["prov"][0]["bbox"],
        "headings": node.metadata.get("headings", []),
    }
    llama_document = Document(
        text=node.text,
        metadata=metadata,
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )
    documents.append(llama_document)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=text_embed_model,
)

Is the segmentation effect of the method above and the method below the same? I am currently developing on the M3 chip.

from llama_index.node_parser.docling import DoclingNodeParser

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()

index = VectorStoreIndex.from_documents(
    documents=reader.load_data(SOURCE),
    transformations=[node_parser],
    embed_model=EMBED_MODEL,
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the DoclingNodeParser process use the OCR library for parsing? #458

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Does the DoclingNodeParser process use the OCR library for parsing? #458

tahitimoon Nov 28, 2024

TLDR

Details

Replies: 2 comments

vagenas Nov 28, 2024 Maintainer

TLDR

Details

tahitimoon Nov 28, 2024 Author

tahitimoon
Nov 28, 2024

vagenas
Nov 28, 2024
Maintainer

tahitimoon
Nov 28, 2024
Author