Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
P
poppler
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 611
    • Issues 611
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 37
    • Merge Requests 37
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • poppler
  • poppler
  • Issues
  • #1025

Closed
Open
Opened Jan 05, 2021 by Renkema@renkema

pdfinfo -struct-text and nested nonstructural marked content

I ran into a problem with pdfinfo -struct-text, which I think is caused by a bug.

The attached file demonstrates the problem. Extracting the textual content gives this result:

$ pdfinfo -struct-text poppler-bug.pdf
Document
  P (block)
    "xxxyyy"

where the expected output would be:

Document
  P (block)
    "xxxyyyzzz"

The relevant part of the attached pdf is the page content stream:

3 0 obj
<< /Length 215 >>        
stream
1 0 0 1 48.272 46.73 cm
/P <</MCID 0>> BDC 1 0 0 1 -48.272 -46.73 cm
BT
/F1 9.96264 Tf
1 0 0 1 48.272 46.73 Tm [(xxx)]TJ
/Span << >> BDC
1 0 0 1 64.046 46.73 Tm [(yyy)]TJ
EMC
1 0 0 1 79.82 46.73 Tm [(zzz)]TJ
ET
EMC

endstream

As you see, the problem is caused by a nonstructural marked content sequence (the /Span) inside a marked content item (marked with /P). This is explicitly allowed by the specification (see §14.7.4.1, p.560), yet somehow cuts short pdfinfo’s text extraction function.

This is on debian unstable, with the following version:

$ pdfinfo -v
pdfinfo version 20.09.0
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Edited Jan 05, 2021 by Renkema
To upload designs, you'll need to enable LFS and have admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: poppler/poppler#1025