Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • P poppler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 656
    • Issues 656
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 42
    • Merge requests 42
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • poppler
  • poppler
  • Issues
  • #617

Closed
Open
Created May 08, 2018 by Bugzilla Migration User@bugzilla-migration

pdftocairo -pdf output breaks extracted text

Submitted by nop..@..il.com

Assigned to poppler-bugs

Link to original bug (#106444)

Description

Created attachment 139431 PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.

Under Ubuntu 16.04 processing select PDFs with pdftocairo -pdf (both versions 0.41.0 (pkg) and 0.64.0 (src)) results in text extracted from the resulting PDF to appear as question mark symbols (suggesting a text encoding problem). The rendered image output appears correct.

I initially observed the problem with the extracted text when programmatically processing the text layer when rendered with pdf.js but then confirmed the behavior looking at the output of pdftotext. (Also when copying text from other pdf viewers.)

Interestingly when the same PDF is processed on a Mac with pdftocairo (0.64.0) the output PDFs extracted text appears correct. I am not sure if it is relevant but in the attached example I do observe some differences in the font encoding as shown below.

pdffonts from original PDF:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
FFXDHY+ArialMT                       TrueType          MacRoman         yes yes no      10  0
EESSLH+Helvetica                     TrueType          WinAnsi          yes yes yes      9  0

pdffonts after processing on Ubuntu:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DFUWOB+ArialMT                       CID TrueType      Identity-H       yes yes yes      5  0

pdffonts after processing on Mac:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DFUWOB+ArialMT                       TrueType          WinAnsi          yes yes yes      5  0

Attachment 139431, "PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.":
cairo-optimized-pdf-extract-text-bug-report.zip

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking