pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

AnonymousCoward128746 · 2025-01-12T22:33:31Z

Leptonica is a wonderfully useful library, thank you so much for all your hard and valuable work.

I am using pixFindBaselines and passing in the optional ppta arg in order to get the approximate endpoints for each baseline. Recently, I discovered that the number of entries in ppta (the endpoints pairs) may be smaller than the number of baselines returned in na. This was surprising to me.

I suspect the culprit code is here:

leptonica/src/baseline.c

Lines 245 to 255 in d5ea1db

    
           nloc = numaGetCount(naloc); 
        
           nbox = boxaGetCount(boxa3); 
        
           for (i = 0; i < nbox; i++) { 
        
               boxaGetBoxGeometry(boxa3, i, &bx, &by, &bw, &bh); 
        
               for (j = 0; j < nloc; j++) { 
        
                   numaGetIValue(naloc, j, &locval); 
        
                   if (L_ABS(locval - (by + bh)) > 25) 
        
                       continue; 
        
                   ptaAddPt(pta, bx, locval); 
        
                   ptaAddPt(pta, bx + bw, locval); 
        
                   break;

This does not guarantee that an item will be added to pta for each baseline, and it sometimes fails in practice on real-world images.

Best Wishes,

The text was updated successfully, but these errors were encountered:

DanBloomberg · 2025-01-13T00:18:47Z

I agree that this could/should be more robust.

Can you post an image you use (with text lines essentially horizontal), that reproduces the problem?

AnonymousCoward128746 · 2025-01-13T02:51:38Z

Sorry, I should have attached one to begin with. This is a cleaned redacted image, but it demonstrates the issue as reported. The three baselines are correctly identified, but the returned endpoints don't include the endpoints for for the middle line. test1.png

Here is some JSON test output I hacked into my code. ys are the baseline y coordinates, and endpoints is a list of tuples (x0,y0,x1,y1) for each pair of endpoints returned by the function. As you can see, 3 baselines, but only 2 endpoint tuples.

{
  "ys": [
    1750.0,
    1792.0,
    1831.0
  ],
  "endpoints": [
    [
      124.0,
      1750.0,
      1164.0,
      1750.0
    ],
    [
      124.0,
      1831.0,
      1164.0,
      1831.0
    ]
  ]
}

…ines * This fixes Issue #766

DanBloomberg · 2025-01-14T04:52:05Z

Thanks for finding this problem.
It should now be fixed.
I also added a couple more tests in baseline_reg.

AnonymousCoward128746 · 2025-01-14T05:07:19Z

The changes still keep the conditional that tries (IIUC) to match up the detected endpoints to the detected baseline, and this match up might still fail, so in theory this could still happen for some inputs, right?

I'll be able to run the new code over my corpus within the next few days and I'll let you know if this fix generalizes properly.

Thank you.

DanBloomberg · 2025-01-14T06:03:52Z

You are correct that it may be still theoretically possible to join the lines after they've initially been located in naloc.
However, I believe this will now be a rare event, because the only morphological operators used to join line segments is strictly horizontal. It was the small vertical closing at 4x reduction that was responsible for joining those two lines.

AnonymousCoward128746 · 2025-01-14T16:01:07Z

I confirm that the issue with the previous test image is fixed.

But, for this test2.png the function returns more than one endpoint pair for the middle line, so again the number of endpoint pairs does not match the number of baselines.

DanBloomberg · 2025-01-14T20:42:39Z

That's actually the design, because sometimes you can have a textline with a large gap, such that two separate textblocks are found on it. That is the case, for example, in the case of the pedante.079.jpg image in baseline_reg.c, where two of the lines of text have separate text blocks. So a 1:1 correspondence of baseline locations with text segments is not required.

However, in the situation you indicated with test2.png, a bogus textbox fragment is generated above the third line, which has a base close to the second line. I've added a filter to remove these small-height boxes, so in your example it is now eliminated and there are only the 3 correct line segments associated with the 3 actual baselines.

A new commit implements this.

AnonymousCoward128746 · 2025-01-14T21:16:44Z

Thank you. Understood, that makes sense for multi-column layouts.

Both test1.png and test2.png pass now.

Here is test3.png. The runt last line returns no endpoints from the function.

Friendly side note: git expects a commit message to include a blank line between the "subject" (first) row and the rest of the commit message body. Without it, git log output suffers.

DanBloomberg · 2025-01-14T21:47:08Z

Thank you for the git tip!

The morphological opening o30.1 in line 222 is what removes runt lines that are less than 120 pixels long (we're working at 4x reduction). I'm OK with this. Note if you were to change that to o20.1, the endpoints of last line would be returned.

AnonymousCoward128746 · 2025-01-14T22:01:16Z

As I wrote earlier, I do think it's reasonable to filter out runts. But, if a baseline is recognized as valid it should have (at least one) pair or endpoints returned. To have a line without endpoints does not make sense to me.

* This is in relation to Issue #766. * If no textbox is found, we do not know the end points of the baseline. It is almost certainly very short, so it is removed from output. * Change order of operation: for each baseline, save all textboxes that describe text at that y-location. There can be multiple textboxes for each baseline if the line of text has large horizontal breaks. * As a result of this change, all reported baselines have x-value endpoints of text that can optionally be returned.

DanBloomberg · 2025-01-15T00:21:29Z

I agree with you. Code modified appropriately.

AnonymousCoward128746 · 2025-01-15T01:15:18Z

I tested on 300+ images and it works much better. Most of the mismatches occurred on dual-column pages as intended, but there were a handful of cases where visually I could see no reason for multiple endpoint segments: test4.png.

The baseline elimination branch was taken quite often. I guess you could chalk it up to bad typesetting that there are a lot of widows(*) in this document. But also In pages featuring dialogue-like texts, such as plays or interviews, where each response is just a sentence or two, it's quite common to have many widows on a page. On some pages I had as many as 4 baselines eliminated because of this. But I can always tweak the threshold and, for a morphology-only solution, this is a reasonable compromise.

Finally, I use an obvious hack to easily get the vertical extent of each line (sans ascender/descender), which is to combine the baselines returned from running pixFindBaselines on the image and it's top-down mirrored version. Surprisingly, mirroring the image vertically often changes the number of detected baselines. What are your thoughts on this?

(*) That's the proper typographic term for a line consisting of a single word at the end of a paragraph, I learned.

AnonymousCoward128746 · 2025-01-15T01:54:19Z

Here's a suggestion for a heuristic that might better handle widows:

If sufficient lines are found on the page, the median of the y-coord differences between baselines should provide a good estimate on line spacing ("leading"). And you can verify it by ensuring the variance of the samples is small.
If a short line to be pruned falls within a leading+/-delta from a baseline, and one of the endpoints matches one of the endpoints from the same neighbor lines, you could use a second, more lenient threshold for pruning false positives.

I think this would cover a high percentage of cases for widows, in most typical page layouts.

Would you be interested in a PR for consideration once we wrap things up here? It shouldn't be that difficult to implement.

DanBloomberg · 2025-01-15T03:20:46Z

Thank you for the investigation.

You make a good suggestion to reduce the number of widows (are 'orphans' just widows on a new page?).

The implementation could be a bit tricky if it were to revise pixFindBaselines(), because the morphological filter with the opening will destroy small widows. Also, I would not want to change the interface.

Let me think about this. I might make a second function with at least one extra parameter, that would share much of pixFindBaselines() but would be gentler on the widows.

If you could, please put up a couple of pages with lots of widows for experimentation.

AnonymousCoward128746 · 2025-01-15T06:15:54Z

Here's a synthetic page at 10pt, a4paper, 300dpi. Multiple widows are not detected as baselines but apparently not due to the new "short baseline removal" (SBR). test5.png

And a real-world image in a dialog/interview format which triggers 5 SBRs. I can't swear this or previous test images are really at 300DPI. test6.png

"Orphans" - yes, exactly!

Note that test6.png is a case where matching the nearest endpoints on top of checking line spacing won't work, because body lines are indented in this format. But there are always trade-offs.

Update: but... the widows do align with one another. There's enough signal for that in pages like test6.

DanBloomberg · 2025-01-15T07:10:00Z

Those are good test images. test5 is 300 ppi on A4; test6 is 200 ppi on 11 x 8.5. You can tell from the height. At 300 ppi, an 11.69 inch tall page is h = 3508. At 200 ppi, an 11 inch tall page is 2200.

DanBloomberg · 2025-01-18T06:53:31Z

See Commit 4e64214.
Added test images 5 & 6.
Generalized to add a parameter specifying the min block with retained.
All saved baselines should have at least one text block referenced to them (i.e., sitting on them)

AnonymousCoward128746 · 2025-01-18T08:58:08Z

This is another big improvement for my use-case (mostly clean scans, but many short lines). I tested on 200 images or so and though I frequently get split baselines, these seem primarily to be the page header being detected as multiple columns (left page num, and right caption), which is by design.

test7.png is a real-world example in which two ordinary full-width lines are detected as multiple columns by pixFindBaselinesGen(minw=1).

DanBloomberg · 2025-01-18T17:56:56Z

Thanks!

I can 'fix' this by changing l.304 to read bh > 12. Both the extra components were 3 pixels high at 4x reduction.

Note that if minw < 12, there is no final horizontal opening in the morphological command on l.269.

One thing that has me puzzled is that before changing bh in l.304, running this test script:

    pix0= pixRead("test7.png");
    pix1 = pixConvertTo1(pix0, 128);
    pixWrite("baseline4.png", pix1, IFF_PNG);
    pixa1 = pixaCreate(0);
    na1 = pixFindBaselinesGen(pix1, 1, &pta1, pixa1);
    numaWriteStderr(na1);
    ptaWriteStream(stderr, pta1, 1);
    pix2 = pixaDisplayTiledInColumns(pixa1, 1, 1.0, 30, 3);
    pixWrite("/tmp/junk1.png", pix2, IFF_PNG);
    pixDestroy(&pix0);
    pixDestroy(&pix1);
    pixDestroy(&pix2);
    numaDestroy(&na1);
    pixaDestroy(&pixa1);
    ptaDestroy(&pta1);

and getting this output:

Numa Version 1
Number of numbers = 2
  [0] = 610.000000
  [1] = 1274.000000


 Pta Version 1
 Number of pts = 8; format = integer
   (620, 610)
   (640, 610)
   (136, 610)
   (624, 610)
   (984, 1274)
   (1084, 1274)
   (132, 1274)
   (1068, 1274)

it generates this

image for junk1.png, and the last page image with the underlines doesn't have the two bogus ones. Don't know why.

AnonymousCoward128746 · 2025-01-18T18:34:18Z

There are obvious limits to what's attainable using a classical morphological approach. It's up to you to say when you think you've reached them. I think you may just have 😃 .

One can push this further by using more global information, like inter-line mutual alignment and so on, but users have everything required to implement a custom solution and so it's reasonable for leptonica to stick to basic implementations. Most developers who need higher accuracy for computer vision tasks would today reach for AI models, anyway.

Do you feel like this is a good place to stop? The fixes to pixFindBaselines and the addition of pixFindBaselinesGen are already a marked improvement.

DanBloomberg · 2025-01-18T19:32:11Z

Yes, I agree this is a good place to stop. As you say, morphological filtering methods without using feedback on things like interline distance remain brittle.

Thank you for your interest and help. I hope you find it useful.

AnonymousCoward128746 · 2025-01-18T19:50:13Z

You've been great, I really appreciate all the attention you gave to this. 👍

* remove components up to 3 pixels high (instead of up to 2) at 4x reduction

DanBloomberg added a commit that referenced this issue Jan 14, 2025

Modify line solidification in pixFindBaselines() to prevent joining l…

6d32545

…ines * This fixes Issue #766

AnonymousCoward128746 closed this as completed Jan 18, 2025

DanBloomberg added a commit that referenced this issue Jan 18, 2025

Final teak to pixFindBaselinesGen() for Issue #766

cfcc172

* remove components up to 3 pixels high (instead of up to 2) at 4x reduction

DanBloomberg added a commit that referenced this issue Jan 18, 2025

Remove default value for minw in pixFindBaselinesGen(); Issue #766

00a3c6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

AnonymousCoward128746 commented Jan 12, 2025 •

edited

Loading

DanBloomberg commented Jan 13, 2025

AnonymousCoward128746 commented Jan 13, 2025 •

edited

Loading

DanBloomberg commented Jan 14, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 14, 2025

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025 •

edited

Loading

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025 •

edited

Loading

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025

DanBloomberg commented Jan 15, 2025

AnonymousCoward128746 commented Jan 15, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 15, 2025

DanBloomberg commented Jan 15, 2025

AnonymousCoward128746 commented Jan 15, 2025 •

edited

Loading

DanBloomberg commented Jan 15, 2025

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025 •

edited

Loading

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025 •

edited

Loading

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025

pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

pixFindBaselines() returns |baselines| != |baseline_endpoint_pairs| #766

Comments

AnonymousCoward128746 commented Jan 12, 2025 • edited Loading

DanBloomberg commented Jan 13, 2025

AnonymousCoward128746 commented Jan 13, 2025 • edited Loading

DanBloomberg commented Jan 14, 2025 • edited Loading

AnonymousCoward128746 commented Jan 14, 2025

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025 • edited Loading

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025 • edited Loading

DanBloomberg commented Jan 14, 2025

AnonymousCoward128746 commented Jan 14, 2025

DanBloomberg commented Jan 15, 2025

AnonymousCoward128746 commented Jan 15, 2025 • edited Loading

AnonymousCoward128746 commented Jan 15, 2025

DanBloomberg commented Jan 15, 2025

AnonymousCoward128746 commented Jan 15, 2025 • edited Loading

DanBloomberg commented Jan 15, 2025

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025 • edited Loading

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025 • edited Loading

DanBloomberg commented Jan 18, 2025

AnonymousCoward128746 commented Jan 18, 2025

AnonymousCoward128746 commented Jan 12, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 13, 2025 •

edited

Loading

DanBloomberg commented Jan 14, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 14, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 14, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 15, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 15, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 18, 2025 •

edited

Loading

AnonymousCoward128746 commented Jan 18, 2025 •

edited

Loading