Skip to content

Commit

Permalink
Better filter for bogus textboxes in pixFindBaselines()
Browse files Browse the repository at this point in the history
* Remove bogus textblocks that are really just part of a real textblock
  but have very small height and are above or below the actual textblock.
* Continue to allow more than one textbox for each baseline.
  This is because large gaps between textblocks in a line make it
  difficult to join safely.
* Could also add a minimal vertical closing (c1.2) to filter in order to
  join bogus textboxes; not done yet because it may not be necessary.
  • Loading branch information
DanBloomberg committed Jan 14, 2025
1 parent 6d32545 commit f9ef244
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 16 deletions.
Binary file added prog/baseline2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions prog/baseline_reg.c
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,20 @@ L_REGPARAMS *rp;
pixDestroy(&pix2);
pixDestroy(&pix5);
numaDestroy(&na);
ptaDestroy(&pta);

/* Another test for baselines, with bogus short 'textblock' */
pixadb = pixaCreate(6);
pix1 = pixRead("baseline2.png");
na = pixFindBaselines(pix1, &pta, pixadb);
regTestCompareValues(rp, 3, numaGetCount(na), 0); /* 11 */
pix2 = pixaDisplayTiledInRows(pixadb, 32, 1500, 1.0, 0, 30, 2);
regTestWritePixAndCheck(rp, pix2, IFF_PNG); /* 12 */
pixDisplayWithTitle(pix2, 1400, 500, NULL, rp->display);
pixaDestroy(&pixadb);
pixDestroy(&pix1);
pixDestroy(&pix2);
numaDestroy(&na);
ptaDestroy(&pta);

return regTestCleanup(rp);
Expand Down
34 changes: 21 additions & 13 deletions src/baseline.c
Original file line number Diff line number Diff line change
Expand Up @@ -242,19 +242,27 @@ PTA *pta;
*ppta = pta;
}
if (pta) {
nloc = numaGetCount(naloc);
nbox = boxaGetCount(boxa3);
for (i = 0; i < nbox; i++) {
boxaGetBoxGeometry(boxa3, i, &bx, &by, &bw, &bh);
for (j = 0; j < nloc; j++) {
numaGetIValue(naloc, j, &locval);
if (L_ABS(locval - (by + bh)) > 25)
continue;
ptaAddPt(pta, bx, locval);
ptaAddPt(pta, bx + bw, locval);
break;
}
}
nloc = numaGetCount(naloc);
nbox = boxaGetCount(boxa3);
/* For each textbox, find the corresponding baseline.
* There may be more than one textbox to a baseline.
* Bogus textboxes of very small height may have been
* generated, and these are removed. Bogus textboxes can
* also be eliminated if the bottom is too far from any of
* the baselines. Note that the boxes are an expansion from
* 4x reduction, so box parameters are multiples of 4. */
for (i = 0; i < nbox; i++) {
boxaGetBoxGeometry(boxa3, i, &bx, &by, &bw, &bh);
if (bh <= 8) continue;
for (j = 0; j < nloc; j++) {
numaGetIValue(naloc, j, &locval);
if (L_ABS(locval - (by + bh)) > 24)
continue;
ptaAddPt(pta, bx, locval);
ptaAddPt(pta, bx + bw, locval);
break;
}
}
}
boxaDestroy(&boxa3);

Expand Down
10 changes: 7 additions & 3 deletions version-notes.html
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,15 @@ <h2 align=center> <IMG SRC="moller52.jpg" border=1 ALIGN_MIDDLE> </h2>
<pre>

1.86.0 Not released
* Modify pixFindBaselines() to avoid joining textboxes and to
ignore bogus textboxes when listing baseline end points.
* Modify convertToPSEmbed() to efficiently encode webp input images.
* Modify compressFilesToPdf() to allow upscale interpolation for
low resolution pdfs.
* Source files changed: pageseg.c, pdfapp.c
* Prog files changed: binarizefiles.c, compresspdf.c, croppdf.c,
misctest2.c
* Source files changed: baseline.c, pageseg.c, pdfapp.c, psio1.c
* Prog files changed: baseline_reg.c, binarizefiles.c,
compresspdf.c, croppdf.c, misctest2.c,
* Prog files added: baseline2.png

1.85.0 Oct 16, 2024
* Use wrapper callSystemDebug() instead of system() in programs.
Expand Down

0 comments on commit f9ef244

Please sign in to comment.