[TextOutputDev] Better detect fakebold words
Draft TODO:
-
Handle rotated pages/blocks correctly -
Add test cases (!1536 (merged)) -
Make "fuzzy" matches more strict (better position checks) -
More realworld testing
TLDR: Improves fakebold detection for "upright" and rotated documents significantly
, also for some rotated texts.
Currently, the fakebold detection code only handles multiple instances of identical words, but fails in several cases:
- the last instance includes e.g. a trailing colon or dot
- the bold word is "quoted", i.e. the first instance includes the opening quote, the last instance the closing quote.
- a non-bold and a bold word are joined with a hyphen
To handle these cases, the matching logic is extended to detect more of these cases.
To avoid a performance penalty the first check is a bounding box overlap test, as it is fast and a requirement for every fakebold case.
For the remaining pairs, a number of possible matches is evaluated:
- same words, as previously
- check if one word is the prefix of the other
- check if the tail of one word is the prefix of the other
Identical words are discarded as before, while for partial matches the overlapping part is discarded from the second word.