Does the DoclingNodeParser process use the OCR library for parsing? #458
-
Does the DoclingNodeParser process use the OCR library for parsing? reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_url)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @tahitimoon, for context, I assume you are referring to the TLDRWhile OCR may be invoked in certain cases, there is a broader workflow in place here. DetailsHere is what is happening under the hood in your snippet:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your response. I am currently parsing a PDF file. In order to better customize metadata in LlamaIndex, this is how I am using it now. It seems that there is no way to customize the pipeline options here either. from llama_index.node_parser.docling import DoclingNodeParser
from llama_index.readers.docling import DoclingReader
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_url)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)
documents = []
for node in nodes:
metadata = {
"page": str(node.metadata["doc_items"][0]["prov"][0]["page_no"]),
"bbox": node.metadata["doc_items"][0]["prov"][0]["bbox"],
"headings": node.metadata.get("headings", []),
}
llama_document = Document(
text=node.text,
metadata=metadata,
text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
)
documents.append(llama_document)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=text_embed_model,
) Is the segmentation effect of the method above and the method below the same? I am currently developing on the M3 chip. from llama_index.node_parser.docling import DoclingNodeParser
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()
index = VectorStoreIndex.from_documents(
documents=reader.load_data(SOURCE),
transformations=[node_parser],
embed_model=EMBED_MODEL,
) |
Beta Was this translation helpful? Give feedback.
Hi @tahitimoon, for context, I assume you are referring to the
DoclingReader
&DoclingNodeParser
from our LlamaIndex integration.TLDR
While OCR may be invoked in certain cases, there is a broader workflow in place here.
Details
Here is what is happening under the hood in your snippet:
DoclingReader
uses Docling to parse the passed file into aDoclingDocument
(see Architecture)DoclingReader
then serializes theDoclingDocument
as JSON into a LlamaIndexDocument
DoclingNodeParser
reads theDoclingDocument
from the LlamaIndexDocument
and, using the configured chunker (by defaultHierarchicalChunker
), p…