pdftocairo doesn't fix the whole pdf
We are working with various pdfs and more often than not we get a pdf that is somehow corrupt, that is, if you try to use pdftocairo, or pdftotext or any other tool from poppler with the document, it shows quite a few errors:
Syntax Error (678743): Missing 'endstream' or incorrect stream length
Syntax Error (1253955): Missing 'endstream' or incorrect stream length
Syntax Error: Missing 'endstream' or incorrect stream length
We use pdftocairo to fix such pdfs and it's working well for the most part, but we have noticed that sometimes pdftocairo doesn't fix the whole pdf the first time. I have attached a file with which you can reproduce this: corrupt.pdf
To reproduce:
- Call
pdftotext corrupt.pdf -
- Observe the output, clearly not all of the text was extracted from the pdf.
- Call
pdftocairo -pdf corrupt.pdf corrupt-fixed.pdf
- Call pdftotext again on fixed pdf
- Observe the output, the text is still not fully extracted
- Call
pdftocairo -pdf corrupt-fixed.pdf corrupt-fixed-twice.pdf
- Call pdftotext on corrupt-fixed-twice.pdf
- Observe that the output text is now bigger and is in fact all the text in pdf.
Expected: we should only have to call pdftocairo on a document once to get the fully fixed pdf
Actual: we have to call pdftocairo 2 times on a document to get the fully fixed pdf
Version: 0.89.0