Skip to content

[TextOutputDev] Better detect fakebold words

Draft TODO:

  • Handle rotated pages/blocks correctly
  • Add test cases (!1536 (merged))
  • Make "fuzzy" matches more strict (better position checks)
  • More realworld testing

TLDR: Improves fakebold detection for "upright" and rotated documents significantly , also for some rotated texts.


Currently, the fakebold detection code only handles multiple instances of identical words, but fails in several cases:

  • the last instance includes e.g. a trailing colon or dot
  • the bold word is "quoted", i.e. the first instance includes the opening quote, the last instance the closing quote.
  • a non-bold and a bold word are joined with a hyphen

To handle these cases, the matching logic is extended to detect more of these cases.

To avoid a performance penalty the first check is a bounding box overlap test, as it is fast and a requirement for every fakebold case.

For the remaining pairs, a number of possible matches is evaluated:

  1. same words, as previously
  2. check if one word is the prefix of the other
  3. check if the tail of one word is the prefix of the other

Identical words are discarded as before, while for partial matches the overlapping part is discarded from the second word.

Edited by StefanBruens

Merge request reports

Loading