Some text ignored in "poppler_page_get_text_layout"
From Evince issue https://gitlab.gnome.org/GNOME/evince/-/issues/1943,
a user provide two mostly identical PDFs: test-bad.pdf and test-good.pdf. For the incorrect one, some items do not get properly selected during search. I run evince in debug mode to show the borders of the different elements found, and in the "bad" pdf (right), we can see that some elements are not selected .
I used pdftotext -bbox-layout
to try to identify the differences between both PDFs, but the only thing I found was (some elements where also ordered in a different way, but I ordered them back to reduce the diff, since otherwise they were identical):
@@ -62,9 +62,10 @@
</block>
</flow>
<flow>
- <block xMin="-0.242505" yMin="376.978798" xMax="38.611765" yMax="464.909000">
- <line xMin="-0.242505" yMin="376.978798" xMax="38.611765" yMax="464.909000">
- <word xMin="-0.242505" yMin="376.978798" xMax="38.611765" yMax="464.909000">DRAFT</word>
+ <block xMin="-0.666505" yMin="312.566050" xMax="38.187765" yMax="529.340000">
+ <line xMin="-0.666505" yMin="312.566050" xMax="38.187765" yMax="529.340000">
+ <word xMin="-0.666505" yMin="441.409798" xMax="38.187765" yMax="529.340000">DRAFT</word>
+ <word xMin="-0.666505" yMin="312.566050" xMax="38.187765" yMax="432.891747">VERSION</word>
</line>
</block>
</flow>
which told me very little about the issue. I did further debugging in poppler, and arrived to the poppler_page_get_text_layout_for_area
. There, getSelectionWords
provides completely different results for the PDFs, with 10 lines for the "good", and 5 lines for the "bad". I do not understand how such a small difference in the layout can provide such results, specially when also some text that does not change at all is not selected. I would be happy to try debug this further, but I would like to know if I could get some further pointers on how to understand how the layout changes affect the text selection, since I'm really missing something.