[TextOutputDev] Better detect fakebold words (!1530) · Merge requests · poppler / poppler

Draft TODO:

TLDR: Improves fakebold detection for "upright" and rotated documents significantly ~~, also for some rotated texts~~.

Currently, the fakebold detection code only handles multiple instances of identical words, but fails in several cases:

the last instance includes e.g. a trailing colon or dot
the bold word is "quoted", i.e. the first instance includes the opening quote, the last instance the closing quote.
a non-bold and a bold word are joined with a hyphen

To handle these cases, the matching logic is extended to detect more of these cases.

To avoid a performance penalty the first check is a bounding box overlap test, as it is fast and a requirement for every fakebold case.

For the remaining pairs, a number of possible matches is evaluated:

Identical words are discarded as before, while for partial matches the overlapping part is discarded from the second word.

Edited Apr 29, 2024 by StefanBruens

Admin message