pdftocairo -pdf output breaks extracted text
Submitted by nop..@..il.com
Assigned to poppler-bugs
Link to original bug (#106444)
Description
Created attachment 139431 PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.
Under Ubuntu 16.04 processing select PDFs with pdftocairo -pdf (both versions 0.41.0 (pkg) and 0.64.0 (src)) results in text extracted from the resulting PDF to appear as question mark symbols (suggesting a text encoding problem). The rendered image output appears correct.
I initially observed the problem with the extracted text when programmatically processing the text layer when rendered with pdf.js but then confirmed the behavior looking at the output of pdftotext. (Also when copying text from other pdf viewers.)
Interestingly when the same PDF is processed on a Mac with pdftocairo (0.64.0) the output PDFs extracted text appears correct. I am not sure if it is relevant but in the attached example I do observe some differences in the font encoding as shown below.
pdffonts from original PDF:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
FFXDHY+ArialMT TrueType MacRoman yes yes no 10 0
EESSLH+Helvetica TrueType WinAnsi yes yes yes 9 0
pdffonts after processing on Ubuntu:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DFUWOB+ArialMT CID TrueType Identity-H yes yes yes 5 0
pdffonts after processing on Mac:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DFUWOB+ArialMT TrueType WinAnsi yes yes yes 5 0
Attachment 139431, "PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.":
cairo-optimized-pdf-extract-text-bug-report.zip