pdftocairo doesn't fix the whole pdf
We are working with various pdfs and more often than not we get a pdf that is somehow corrupt, that is, if you try to use pdftocairo, or pdftotext or any other tool from poppler with the document, it shows quite a few errors:
Syntax Error (678743): Missing 'endstream' or incorrect stream length Syntax Error (1253955): Missing 'endstream' or incorrect stream length Syntax Error: Missing 'endstream' or incorrect stream length
We use pdftocairo to fix such pdfs and it's working well for the most part, but we have noticed that sometimes pdftocairo doesn't fix the whole pdf the first time. I have attached a file with which you can reproduce this: corrupt.pdf
pdftotext corrupt.pdf -
- Observe the output, clearly not all of the text was extracted from the pdf.
pdftocairo -pdf corrupt.pdf corrupt-fixed.pdf
- Call pdftotext again on fixed pdf
- Observe the output, the text is still not fully extracted
pdftocairo -pdf corrupt-fixed.pdf corrupt-fixed-twice.pdf
- Call pdftotext on corrupt-fixed-twice.pdf
- Observe that the output text is now bigger and is in fact all the text in pdf.
Expected: we should only have to call pdftocairo on a document once to get the fully fixed pdf
Actual: we have to call pdftocairo 2 times on a document to get the fully fixed pdf