Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Recognize Split words with - at the end of a line #120

Open
GetGit-M-Link opened this issue Jan 3, 2025 · 0 comments
Open

Comments

@GetGit-M-Link
Copy link

Docling does not recognize that word that are at the end on a line with a - are often the same word or strongly connected to the word in the next line.
Since my professor did this a lot in the script I am currently working through I wrote a script for myself that checks all the text ending in -, but I think this is worth looking into.

Here is an example:
Bildschirmfoto_20250103_104437

Docling correctly recognizes part of it as the caption (I exported to json to get an idea what is going on):
Bildschirmfoto_20250103_104900

Bildschirmfoto_20250103_104957
But the second part is read as normal text not caption and is not connected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant