pdftohtml: Generates xml with mismatched <b><i> tags
Input: Section_4.pdf Output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="21.06.1">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="23" family="TimesNewRomanPSMT" color="#000000"/>
<fontspec id="1" size="23" family="TimesNewRomanPS" color="#000000"/>
<fontspec id="2" size="23" family="TimesNewRomanPS" color="#000000"/>
<fontspec id="3" size="23" family="TimesNewRomanPS" color="#000000"/>
<text top="114" left="108" width="114" height="22" font="0">Section 4.3 </text>
<text top="114" left="222" width="194" height="22" font="1"><b>S<i>ecurities, Act, Etc</b>.</i></text>
<text top="114" left="416" width="6" height="22" font="0"> </text>
</page>
</pdf2xml>
The problem should be in coalesce which decides whether to combine 2 strings. Notice that in the given input, the "S" from "Securities" is bold while the rest ("ecurities, Act, Etc") is bold and italic. The last character "." stops being bold but it is still italic.
Because the family name of all the characters is the same (TimesNewRomanPS) (checked in hfont1->isEqualIgnoreBold(*hfont2)) the strings ("S" and "ecurities, Act, Etc.") are combined but this results in mismatched tags.