Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
P
poppler
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 615
    • Issues 615
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 38
    • Merge Requests 38
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • poppler
  • poppler
  • Issues
  • #922

Closed
Open
Opened Jun 01, 2020 by Vidmantas@DonutLaser

pdftocairo doesn't fix the whole pdf

We are working with various pdfs and more often than not we get a pdf that is somehow corrupt, that is, if you try to use pdftocairo, or pdftotext or any other tool from poppler with the document, it shows quite a few errors:

Syntax Error (678743): Missing 'endstream' or incorrect stream length
Syntax Error (1253955): Missing 'endstream' or incorrect stream length
Syntax Error: Missing 'endstream' or incorrect stream length

We use pdftocairo to fix such pdfs and it's working well for the most part, but we have noticed that sometimes pdftocairo doesn't fix the whole pdf the first time. I have attached a file with which you can reproduce this: corrupt.pdf

To reproduce:

  1. Call pdftotext corrupt.pdf -
  2. Observe the output, clearly not all of the text was extracted from the pdf.
  3. Call pdftocairo -pdf corrupt.pdf corrupt-fixed.pdf
  4. Call pdftotext again on fixed pdf
  5. Observe the output, the text is still not fully extracted
  6. Call pdftocairo -pdf corrupt-fixed.pdf corrupt-fixed-twice.pdf
  7. Call pdftotext on corrupt-fixed-twice.pdf
  8. Observe that the output text is now bigger and is in fact all the text in pdf.

Expected: we should only have to call pdftocairo on a document once to get the fully fixed pdf
Actual: we have to call pdftocairo 2 times on a document to get the fully fixed pdf

Version: 0.89.0

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: poppler/poppler#922