Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Information extracted from table/image using Azure Document Intelligence API is not reflected in GraphRAG input #594

Open
hide212131 opened this issue Jan 3, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@hide212131
Copy link

Description

When a PDF document with the following structure is read by Azure Document Intelligence, files for Paragraph 1 and Paragraph 2 are created in the GraphRAG input folder, but no file is created for the Table/Image(description).

Paragraph 1
Table
Paragraph 2
Image
...

Reproduction steps

1. In Retrieval settings > GraphRAG Collection > File loader, select `Azure AI Document Intelligence (figure+table extraction)`
1. Upload a PDF file containing a table in GraphRAG
1. Execute a query related to the table

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

AzureAIDocumentIntelligenceLoader stores Text/Table/Image separately in the Document without duplication, while GraphRAGIndexingPipeline outputs only Text.

I think it would be more appropriate to have a format like ktem_app_data/markdown_cache_dir, where tables and other elements are expanded inline, as the text to be indexed.

@hide212131 hide212131 added the bug Something isn't working label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant