Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keywords that end a code stream are ignored. #884

Closed
qued opened this issue May 4, 2023 · 2 comments · Fixed by #885
Closed

Keywords that end a code stream are ignored. #884

qued opened this issue May 4, 2023 · 2 comments · Fixed by #885
Labels
status: accepted type:anomaly Errors caused by deviations from the PDF Reference

Comments

@qued
Copy link
Contributor

qued commented May 4, 2023

Bug report

Certain PDFs contain code where a stream of operations ends with a keyword, without any of the END_KEYWORD characters following it. This causes the parsing to end without the keyword being parsed, even though otherwise the operation stream is valid. For example, in the attached PDF, a code stream consists of b'543 0 0 738 0 0 cm\n /Im1 Do'.

The expected behavior is that if a code stream terminates while parsing a keyword, if the keyword is valid it will execute. Readers such as Adobe Reader, Chrome, and Firefox are able to render the images in the example PDF linked below.

To reproduce

from pdfminer.high_level import extract_pages

elements = [el for page in extract_pages("IRS-form-1987.pdf") for el in page]

...will produce a list with two empty lists (one for each page), when each page contains a single image as an XObject.

Attached file used to reproduce: IRS-form-1987.pdf

@pietermarsman
Copy link
Member

I can reproduce this.

from pdfminer.high_level import extract_pages

for page in extract_pages("IRS-form-1987.pdf"):
    for el in page:
        print(page, el)

Prints nothing.

@pietermarsman pietermarsman added status: accepted type:anomaly Errors caused by deviations from the PDF Reference labels Dec 31, 2023
dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Aug 1, 2024
@cole-dda
Copy link

cole-dda commented Dec 25, 2024

The fix issue other problem。

Because parser use buf,sush as, stream = b'/IM Do'
but now buf=b'/IM D'
parse keyword => 'D'
use old method ,continue fill buf,again parse => 'Do'

code may be:
screenshot_6596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: accepted type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants