libpoppler cannot recreate pdftotext output, because physical_layout is not handled correctly
Submitted by dum..@..gmx.fr
Assigned to poppler-bugs
Link to original bug (#103798)
Description
Dear maintainer, this bug concerns poppler 0.48.0 up to at least 0.60.1
in file .../gcc/poppler-page.cpp
the function
ustring page::text(const rectf &r, text_layout_enum layout_mode) const
when called with physical_layout as layout_mode incorrectly creates a TextOutputDev with second parameter (supposed to be true for physical_layout) always set to gFalse, because the corresponding code in lines 272 and 273 (poppler 0.60.1) are
const GBool use_raw_order = (layout_mode == raw_order_layout);
TextOutputDev td(0, gFalse, 0, use_raw_order, gFalse);
By contrast the pdftotext.cc creates TextOutputDev with second parameter set to gTrue when called with the -layout command line option.
THE EFFECT, is that the text produced inside a program using libpoppler differs from the more faithful text (which has, for example, blank lines where required) produced by invoking pdftotext with the -layout option.
Would the following be a solution? const GBool use_raw_order = (layout_mode == raw_order_layout); const GBool use_physical_layout = !use_raw_order; TextOutputDev td(0, use_physical_layout, 0, use_raw_order, gFalse);
I would be grateful, if this could be fixed. The alternative I do not relish, would appear to be to compile virtually all of the poppler source code into my program, just to give it access to TextOutputDev and thus be able to call it with gTrue as second parameter. This does not appear to be what libpoppler is supposed to be for.