-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766
Comments
I agree that this could/should be more robust. Can you post an image you use (with text lines essentially horizontal), that reproduces the problem? |
Sorry, I should have attached one to begin with. This is a cleaned redacted image, but it demonstrates the issue as reported. The three baselines are correctly identified, but the returned endpoints don't include the endpoints for for the middle line. test1.png Here is some JSON test output I hacked into my code. {
"ys": [
1750.0,
1792.0,
1831.0
],
"endpoints": [
[
124.0,
1750.0,
1164.0,
1750.0
],
[
124.0,
1831.0,
1164.0,
1831.0
]
]
} |
Thanks for finding this problem. |
The changes still keep the conditional that tries (IIUC) to match up the detected endpoints to the detected baseline, and this match up might still fail, so in theory this could still happen for some inputs, right? I'll be able to run the new code over my corpus within the next few days and I'll let you know if this fix generalizes properly. Thank you. |
You are correct that it may be still theoretically possible to join the lines after they've initially been located in |
I confirm that the issue with the previous test image is fixed. But, for this test2.png the function returns more than one endpoint pair for the middle line, so again the number of endpoint pairs does not match the number of baselines. |
That's actually the design, because sometimes you can have a textline with a large gap, such that two separate textblocks are found on it. That is the case, for example, in the case of the pedante.079.jpg image in baseline_reg.c, where two of the lines of text have separate text blocks. So a 1:1 correspondence of baseline locations with text segments is not required. However, in the situation you indicated with test2.png, a bogus textbox fragment is generated above the third line, which has a base close to the second line. I've added a filter to remove these small-height boxes, so in your example it is now eliminated and there are only the 3 correct line segments associated with the 3 actual baselines. A new commit implements this. |
Thank you. Understood, that makes sense for multi-column layouts. Both Here is test3.png. The runt last line returns no endpoints from the function. Friendly side note: git expects a commit message to include a blank line between the "subject" (first) row and the rest of the commit message body. Without it, |
Thank you for the git tip! The morphological opening |
As I wrote earlier, I do think it's reasonable to filter out runts. But, if a baseline is recognized as valid it should have (at least one) pair or endpoints returned. To have a line without endpoints does not make sense to me. |
* This is in relation to Issue #766. * If no textbox is found, we do not know the end points of the baseline. It is almost certainly very short, so it is removed from output. * Change order of operation: for each baseline, save all textboxes that describe text at that y-location. There can be multiple textboxes for each baseline if the line of text has large horizontal breaks. * As a result of this change, all reported baselines have x-value endpoints of text that can optionally be returned.
I agree with you. Code modified appropriately. |
I tested on 300+ images and it works much better. Most of the mismatches occurred on dual-column pages as intended, but there were a handful of cases where visually I could see no reason for multiple endpoint segments: test4.png. The baseline elimination branch was taken quite often. I guess you could chalk it up to bad typesetting that there are a lot of widows(*) in this document. But also In pages featuring dialogue-like texts, such as plays or interviews, where each response is just a sentence or two, it's quite common to have many widows on a page. On some pages I had as many as 4 baselines eliminated because of this. But I can always tweak the threshold and, for a morphology-only solution, this is a reasonable compromise. Finally, I use an obvious hack to easily get the vertical extent of each line (sans ascender/descender), which is to combine the baselines returned from running (*) That's the proper typographic term for a line consisting of a single word at the end of a paragraph, I learned. |
Here's a suggestion for a heuristic that might better handle widows:
I think this would cover a high percentage of cases for widows, in most typical page layouts. Would you be interested in a PR for consideration once we wrap things up here? It shouldn't be that difficult to implement. |
Thank you for the investigation. You make a good suggestion to reduce the number of widows (are 'orphans' just widows on a new page?). The implementation could be a bit tricky if it were to revise pixFindBaselines(), because the morphological filter with the opening will destroy small widows. Also, I would not want to change the interface. Let me think about this. I might make a second function with at least one extra parameter, that would share much of pixFindBaselines() but would be gentler on the widows. If you could, please put up a couple of pages with lots of widows for experimentation. |
Here's a synthetic page at 10pt, a4paper, 300dpi. Multiple widows are not detected as baselines but apparently not due to the new "short baseline removal" (SBR). test5.png And a real-world image in a dialog/interview format which triggers 5 SBRs. I can't swear this or previous test images are really at 300DPI. test6.png "Orphans" - yes, exactly! Note that Update: but... the widows do align with one another. There's enough signal for that in pages like test6. |
Those are good test images. test5 is 300 ppi on A4; test6 is 200 ppi on 11 x 8.5. You can tell from the height. At 300 ppi, an 11.69 inch tall page is h = 3508. At 200 ppi, an 11 inch tall page is 2200. |
See Commit 4e64214. |
This is another big improvement for my use-case (mostly clean scans, but many short lines). I tested on 200 images or so and though I frequently get split baselines, these seem primarily to be the page header being detected as multiple columns (left page num, and right caption), which is by design. test7.png is a real-world example in which two ordinary full-width lines are detected as multiple columns by |
Thanks! I can 'fix' this by changing l.304 to read Note that if One thing that has me puzzled is that before changing bh in l.304, running this test script:
and getting this output:
it generates this image for junk1.png, and the last page image with the underlines doesn't have the two bogus ones. Don't know why. |
There are obvious limits to what's attainable using a classical morphological approach. It's up to you to say when you think you've reached them. I think you may just have 😃 . One can push this further by using more global information, like inter-line mutual alignment and so on, but users have everything required to implement a custom solution and so it's reasonable for leptonica to stick to basic implementations. Most developers who need higher accuracy for computer vision tasks would today reach for AI models, anyway. Do you feel like this is a good place to stop? The fixes to |
Yes, I agree this is a good place to stop. As you say, morphological filtering methods without using feedback on things like interline distance remain brittle. Thank you for your interest and help. I hope you find it useful. |
You've been great, I really appreciate all the attention you gave to this. 👍 |
* remove components up to 3 pixels high (instead of up to 2) at 4x reduction
Leptonica is a wonderfully useful library, thank you so much for all your hard and valuable work.
I am using
pixFindBaselines
and passing in the optionalppta
arg in order to get the approximate endpoints for each baseline. Recently, I discovered that the number of entries inppta
(the endpoints pairs) may be smaller than the number of baselines returned inna
. This was surprising to me.I suspect the culprit code is here:
leptonica/src/baseline.c
Lines 245 to 255 in d5ea1db
This does not guarantee that an item will be added to
pta
for each baseline, and it sometimes fails in practice on real-world images.Best Wishes,
The text was updated successfully, but these errors were encountered: