Poppler::Page::text not working correctly with RawOrderLayout
I am trying to get the plain text from a document, in content order.
Page::text(QRectF{}, Page::PhysicalLayout)
works reasonably well, and is able to extract the complete contents. For Page::RawOrderLayout
, the results are fairly broken:
- The first, trivial document returns the contents without spaces between words.
- The second, slightly more complex document does not return any text at all.
When using pdftotext
, with -raw
, -layout
or "default", the content is correct.
The missing spaces are likely caused by implementation differences in TextOutputDev between TextPage::getText
(used by Popper::Page::text
) and TextPage::dump
(used by pdftotext) - the latter has some code to insert spaces:
https://gitlab.freedesktop.org/poppler/poppler/-/blame/master/poppler/TextOutputDev.cc?ref_type=heads&page=6#L5391