pdfinfo -struct-text and nested nonstructural marked content
I ran into a problem with pdfinfo -struct-text
, which I think is caused by
a bug.
The attached file demonstrates the problem. Extracting the textual content gives this result:
$ pdfinfo -struct-text poppler-bug.pdf
Document
P (block)
"xxxyyy"
where the expected output would be:
Document
P (block)
"xxxyyyzzz"
The relevant part of the attached pdf is the page content stream:
3 0 obj
<< /Length 215 >>
stream
1 0 0 1 48.272 46.73 cm
/P <</MCID 0>> BDC 1 0 0 1 -48.272 -46.73 cm
BT
/F1 9.96264 Tf
1 0 0 1 48.272 46.73 Tm [(xxx)]TJ
/Span << >> BDC
1 0 0 1 64.046 46.73 Tm [(yyy)]TJ
EMC
1 0 0 1 79.82 46.73 Tm [(zzz)]TJ
ET
EMC
endstream
As you see, the problem is caused by a nonstructural marked content sequence
(the /Span
) inside a marked content item (marked with /P
). This is
explicitly allowed by the specification (see §14.7.4.1, p.560), yet somehow
cuts short pdfinfo’s text extraction function.
This is on debian unstable, with the following version:
$ pdfinfo -v
pdfinfo version 20.09.0
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
To upload designs, you'll need to enable LFS and have admin enable hashed storage. More information