Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

Closed
AnonymousCoward128746 opened this issue Jan 12, 2025 · 22 comments

Comments

@AnonymousCoward128746
Copy link

AnonymousCoward128746 commented Jan 12, 2025

Leptonica is a wonderfully useful library, thank you so much for all your hard and valuable work.

I am using pixFindBaselines and passing in the optional ppta arg in order to get the approximate endpoints for each baseline. Recently, I discovered that the number of entries in ppta (the endpoints pairs) may be smaller than the number of baselines returned in na. This was surprising to me.

I suspect the culprit code is here:

leptonica/src/baseline.c

Lines 245 to 255 in d5ea1db

nloc = numaGetCount(naloc);
nbox = boxaGetCount(boxa3);
for (i = 0; i < nbox; i++) {
boxaGetBoxGeometry(boxa3, i, &bx, &by, &bw, &bh);
for (j = 0; j < nloc; j++) {
numaGetIValue(naloc, j, &locval);
if (L_ABS(locval - (by + bh)) > 25)
continue;
ptaAddPt(pta, bx, locval);
ptaAddPt(pta, bx + bw, locval);
break;

This does not guarantee that an item will be added to pta for each baseline, and it sometimes fails in practice on real-world images.

Best Wishes,

@AnonymousCoward128746 AnonymousCoward128746 changed the title pixFindBaselines() results dubiously returns |baselines| != |baseline_endpoint_pairs| pixFindBaselines() dubiously returns |baselines| != |baseline_endpoint_pairs| Jan 12, 2025
@DanBloomberg
Copy link
Owner

I agree that this could/should be more robust.

Can you post an image you use (with text lines essentially horizontal), that reproduces the problem?

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 13, 2025

Sorry, I should have attached one to begin with. This is a cleaned redacted image, but it demonstrates the issue as reported. The three baselines are correctly identified, but the returned endpoints don't include the endpoints for for the middle line. test1.png

Here is some JSON test output I hacked into my code. ys are the baseline y coordinates, and endpoints is a list of tuples (x0,y0,x1,y1) for each pair of endpoints returned by the function. As you can see, 3 baselines, but only 2 endpoint tuples.

{
  "ys": [
    1750.0,
    1792.0,
    1831.0
  ],
  "endpoints": [
    [
      124.0,
      1750.0,
      1164.0,
      1750.0
    ],
    [
      124.0,
      1831.0,
      1164.0,
      1831.0
    ]
  ]
}

DanBloomberg added a commit that referenced this issue Jan 14, 2025
@DanBloomberg
Copy link
Owner

DanBloomberg commented Jan 14, 2025

Thanks for finding this problem.
It should now be fixed.
I also added a couple more tests in baseline_reg.

@AnonymousCoward128746
Copy link
Author

The changes still keep the conditional that tries (IIUC) to match up the detected endpoints to the detected baseline, and this match up might still fail, so in theory this could still happen for some inputs, right?

I'll be able to run the new code over my corpus within the next few days and I'll let you know if this fix generalizes properly.

Thank you.

@DanBloomberg
Copy link
Owner

You are correct that it may be still theoretically possible to join the lines after they've initially been located in naloc.
However, I believe this will now be a rare event, because the only morphological operators used to join line segments is strictly horizontal. It was the small vertical closing at 4x reduction that was responsible for joining those two lines.

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 14, 2025

I confirm that the issue with the previous test image is fixed.

But, for this test2.png the function returns more than one endpoint pair for the middle line, so again the number of endpoint pairs does not match the number of baselines.

@AnonymousCoward128746 AnonymousCoward128746 changed the title pixFindBaselines() dubiously returns |baselines| != |baseline_endpoint_pairs| pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| Jan 14, 2025
@DanBloomberg
Copy link
Owner

That's actually the design, because sometimes you can have a textline with a large gap, such that two separate textblocks are found on it. That is the case, for example, in the case of the pedante.079.jpg image in baseline_reg.c, where two of the lines of text have separate text blocks. So a 1:1 correspondence of baseline locations with text segments is not required.

However, in the situation you indicated with test2.png, a bogus textbox fragment is generated above the third line, which has a base close to the second line. I've added a filter to remove these small-height boxes, so in your example it is now eliminated and there are only the 3 correct line segments associated with the 3 actual baselines.

A new commit implements this.

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 14, 2025

Thank you. Understood, that makes sense for multi-column layouts.

Both test1.png and test2.png pass now.

Here is test3.png. The runt last line returns no endpoints from the function.

Friendly side note: git expects a commit message to include a blank line between the "subject" (first) row and the rest of the commit message body. Without it, git log output suffers.

@DanBloomberg
Copy link
Owner

Thank you for the git tip!

The morphological opening o30.1 in line 222 is what removes runt lines that are less than 120 pixels long (we're working at 4x reduction). I'm OK with this. Note if you were to change that to o20.1, the endpoints of last line would be returned.

@AnonymousCoward128746
Copy link
Author

As I wrote earlier, I do think it's reasonable to filter out runts. But, if a baseline is recognized as valid it should have (at least one) pair or endpoints returned. To have a line without endpoints does not make sense to me.

DanBloomberg added a commit that referenced this issue Jan 15, 2025
* This is in relation to Issue #766.
* If no textbox is found, we do not know the end points of the baseline.
  It is almost certainly very short, so it is removed from output.
* Change order of operation: for each baseline, save all textboxes that
  describe text at that y-location.  There can be multiple textboxes
  for each baseline if the line of text has large horizontal breaks.
* As a result of this change, all reported baselines have x-value
  endpoints of text that can optionally be returned.
@DanBloomberg
Copy link
Owner

I agree with you. Code modified appropriately.

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 15, 2025

I tested on 300+ images and it works much better. Most of the mismatches occurred on dual-column pages as intended, but there were a handful of cases where visually I could see no reason for multiple endpoint segments: test4.png.

The baseline elimination branch was taken quite often. I guess you could chalk it up to bad typesetting that there are a lot of widows(*) in this document. But also In pages featuring dialogue-like texts, such as plays or interviews, where each response is just a sentence or two, it's quite common to have many widows on a page. On some pages I had as many as 4 baselines eliminated because of this. But I can always tweak the threshold and, for a morphology-only solution, this is a reasonable compromise.

Finally, I use an obvious hack to easily get the vertical extent of each line (sans ascender/descender), which is to combine the baselines returned from running pixFindBaselines on the image and it's top-down mirrored version. Surprisingly, mirroring the image vertically often changes the number of detected baselines. What are your thoughts on this?

(*) That's the proper typographic term for a line consisting of a single word at the end of a paragraph, I learned.

@AnonymousCoward128746
Copy link
Author

Here's a suggestion for a heuristic that might better handle widows:

  1. If sufficient lines are found on the page, the median of the y-coord differences between baselines should provide a good estimate on line spacing ("leading"). And you can verify it by ensuring the variance of the samples is small.
  2. If a short line to be pruned falls within a leading+/-delta from a baseline, and one of the endpoints matches one of the endpoints from the same neighbor lines, you could use a second, more lenient threshold for pruning false positives.

I think this would cover a high percentage of cases for widows, in most typical page layouts.

Would you be interested in a PR for consideration once we wrap things up here? It shouldn't be that difficult to implement.

@DanBloomberg
Copy link
Owner

Thank you for the investigation.

You make a good suggestion to reduce the number of widows (are 'orphans' just widows on a new page?).

The implementation could be a bit tricky if it were to revise pixFindBaselines(), because the morphological filter with the opening will destroy small widows. Also, I would not want to change the interface.

Let me think about this. I might make a second function with at least one extra parameter, that would share much of pixFindBaselines() but would be gentler on the widows.

If you could, please put up a couple of pages with lots of widows for experimentation.

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 15, 2025

Here's a synthetic page at 10pt, a4paper, 300dpi. Multiple widows are not detected as baselines but apparently not due to the new "short baseline removal" (SBR). test5.png

And a real-world image in a dialog/interview format which triggers 5 SBRs. I can't swear this or previous test images are really at 300DPI. test6.png

"Orphans" - yes, exactly!

Note that test6.png is a case where matching the nearest endpoints on top of checking line spacing won't work, because body lines are indented in this format. But there are always trade-offs.

Update: but... the widows do align with one another. There's enough signal for that in pages like test6.

@DanBloomberg
Copy link
Owner

Those are good test images. test5 is 300 ppi on A4; test6 is 200 ppi on 11 x 8.5. You can tell from the height. At 300 ppi, an 11.69 inch tall page is h = 3508. At 200 ppi, an 11 inch tall page is 2200.

@DanBloomberg
Copy link
Owner

See Commit 4e64214.
Added test images 5 & 6.
Generalized to add a parameter specifying the min block with retained.
All saved baselines should have at least one text block referenced to them (i.e., sitting on them)

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 18, 2025

This is another big improvement for my use-case (mostly clean scans, but many short lines). I tested on 200 images or so and though I frequently get split baselines, these seem primarily to be the page header being detected as multiple columns (left page num, and right caption), which is by design.

test7.png is a real-world example in which two ordinary full-width lines are detected as multiple columns by pixFindBaselinesGen(minw=1).

@DanBloomberg
Copy link
Owner

Thanks!

I can 'fix' this by changing l.304 to read bh > 12. Both the extra components were 3 pixels high at 4x reduction.

Note that if minw < 12, there is no final horizontal opening in the morphological command on l.269.

One thing that has me puzzled is that before changing bh in l.304, running this test script:

    pix0= pixRead("test7.png");
    pix1 = pixConvertTo1(pix0, 128);
    pixWrite("baseline4.png", pix1, IFF_PNG);
    pixa1 = pixaCreate(0);
    na1 = pixFindBaselinesGen(pix1, 1, &pta1, pixa1);
    numaWriteStderr(na1);
    ptaWriteStream(stderr, pta1, 1);
    pix2 = pixaDisplayTiledInColumns(pixa1, 1, 1.0, 30, 3);
    pixWrite("/tmp/junk1.png", pix2, IFF_PNG);
    pixDestroy(&pix0);
    pixDestroy(&pix1);
    pixDestroy(&pix2);
    numaDestroy(&na1);
    pixaDestroy(&pixa1);
    ptaDestroy(&pta1);

and getting this output:

Numa Version 1
Number of numbers = 2
  [0] = 610.000000
  [1] = 1274.000000


 Pta Version 1
 Number of pts = 8; format = integer
   (620, 610)
   (640, 610)
   (136, 610)
   (624, 610)
   (984, 1274)
   (1084, 1274)
   (132, 1274)
   (1068, 1274)

it generates this

Image

image for junk1.png, and the last page image with the underlines doesn't have the two bogus ones. Don't know why.

@AnonymousCoward128746
Copy link
Author

AnonymousCoward128746 commented Jan 18, 2025

There are obvious limits to what's attainable using a classical morphological approach. It's up to you to say when you think you've reached them. I think you may just have 😃 .

One can push this further by using more global information, like inter-line mutual alignment and so on, but users have everything required to implement a custom solution and so it's reasonable for leptonica to stick to basic implementations. Most developers who need higher accuracy for computer vision tasks would today reach for AI models, anyway.

Do you feel like this is a good place to stop? The fixes to pixFindBaselines and the addition of pixFindBaselinesGen are already a marked improvement.

@DanBloomberg
Copy link
Owner

Yes, I agree this is a good place to stop. As you say, morphological filtering methods without using feedback on things like interline distance remain brittle.

Thank you for your interest and help. I hope you find it useful.

@AnonymousCoward128746
Copy link
Author

You've been great, I really appreciate all the attention you gave to this. 👍

DanBloomberg added a commit that referenced this issue Jan 18, 2025
* remove components up to 3 pixels high (instead of up to 2) at 4x reduction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants