pdftohtml loses some double lls in duplicate check
Submitted by Chris Faulhaber
Assigned to poppler-bugs
Problem: In some PDF documents two lls will overlap slightly. pdftohtml will drop latter l. E.g., called because cal ed, all becomes al , and eventually becomes eventual y.
Reason: In HtmlOutputDev.cc, class HtmlPage, method coalesce, there's a section of code to discard duplicate text for "fake boldface, drop shadows." The lls are triggering the duplicate code and are thus removed from the output.
The debug output shows: x=139.68000..143.016000 y=626.076000..641.844000 size=15 'l' x=142.80000..146.136000 y=626.076000..641.844000 size=15 'l'
Due to my inexperience with the project I can't say what the best solution will be. Here are a few options I've considered. If you'd like to suggest a preferred method for solving this problem I will implement and submit a patch, however I have no test documents that involve actual duplicate text.
Solution 1: Decrease the fudge factor from 0.2 to 0.1. This may not be reliable and could cause the duplicates which this code was originally meant to discard to resurface. It will, however, let the lls through in my test documents.
Solution 2: Make the duplicate check a command-line option. Documents that have both lls and duplicate text will still exhibit errors, though.
Solution 3: Use a different algorithm for determining duplicate text. Perhaps the dupe check shouldn't drop characters that start more than halfway between the bounding box of the last character. In this example, 141.348 is the halfway point for the first character, and 142.8 is beyond that. It seems unlikely for boldface or drop shadows to be so far beyond the starting point of their host character.