Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted spaces in Content #528

Closed
halbano opened this issue Apr 19, 2022 · 7 comments · Fixed by #634
Closed

Unwanted spaces in Content #528

halbano opened this issue Apr 19, 2022 · 7 comments · Fixed by #634
Labels

Comments

@halbano
Copy link

halbano commented Apr 19, 2022

  • PHP Version: 7.2
  • PDFParser Version: 2.1 2.2.0

Description:

Hey guys, thanks for the hard work to keep this great library up to date.
Unfortunately, we are having one strange issue parsing a file, we tried adjusting the config like this as @rubenvanerk mentioned in a few issues (in fact the FontSpaceLimit variable solves multiple issues with the char separation).

$config = new \Smalot\PdfParser\Config();
$config->setHorizontalOffset('');
$config->setFontSpaceLimit(-136);

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile("../storage/app/IL-Field-Guide-final-online.pdf");

Expected output & actual output

But we cannot get rid of some unwanted empty spaces for example with:

"as opposed to persons i n vehicles" should be "as opposed to persons in vehicles"

And more importantly in the PDF section titles we have:

"Consensual Encounters - Tier 1 in Seizure M odel" should be "Consensual Encounters - Tier 1 in Seizure Model"

Additionally, this one is not the expected output too:

"Your Powers and the Suspe ct's R ights during a TERRY Stop" needs to be "Your Powers and the Suspect's Rights during a TERRY Stop"

PDF input

IL-Field-Guide-final-online.pdf

This may be not necessarily an issue, but we are suspecting that for some reason the PDF has a space within the conflicting phrases/sentences.
In any case, we are starting to use the library, we modified a few things on the vendors folder trying to fix the issue, but we are going out of ideas now.

@halbano halbano changed the title Unwanted spaces on random content Unwanted spaces in Content Apr 19, 2022
@k00ni k00ni added the bug label Apr 20, 2022
@k00ni
Copy link
Collaborator

k00ni commented Apr 20, 2022

Hey. Just to be sure, can you try again with PDFParser v2.2.0?

@halbano
Copy link
Author

halbano commented Apr 20, 2022

Hey @k00ni, the same happens using version 2.2.0

@halbano
Copy link
Author

halbano commented Apr 21, 2022

@rubenvanerk maybe you have any thoughts?

@rubenvanerk
Copy link
Contributor

Sorry, can't help you here.

@halbano
Copy link
Author

halbano commented Apr 22, 2022

@k00ni I know you may be slow responding, but please let us know if you have any ideas or suggestions we could try. I appreciate your help in advance.

@k00ni
Copy link
Collaborator

k00ni commented Apr 25, 2022

Sorry, I can't help you here, I would have written you already. The only idea I have is to check the code part which uses the FontSpaceLimit config and try to debug it. Good luck.

@cyprus1spirit
Copy link

The issue with spaces is not solved on latest version (2.2.1) either. You need to apply the workaround:

$config = new \Smalot\PdfParser\Config(); // fixes the presentation of extra spaces issue
$config->setHorizontalOffset(''); // fixes the presentation of extra spaces issue

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('test.pdf');
$pdfText = $pdf->getText();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants