Bug in PdfToHtml
There is a bug in PdfToHtml (version 0.83)
When I execute
pdftohtml -xml input.pdf output.xml
all is fine. The output looks like this:
....
<text top="363" left="81" width="413" height="61" font="1">Bla Bla</text>
<text top="422" left="81" width="514" height="40" font="2">Bla Bla</text>
<text top="1131" left="765" width="72" height="16" font="3">Bla Bla</text>
</page>
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="4" size="8" family="PalatinoLinotype" color="#ffffff"/>
<fontspec id="5" size="11" family="PalatinoLinotype" color="#000000"/>
<fontspec id="6" size="11" family="Arial" color="#000000"/>
<fontspec id="7" size="6" family="TimesNewRomanPSMT" color="#000000"/>
<fontspec id="8" size="9" family="TimesNewRomanPSMT" color="#000000"/>
<text top="104" left="81" width="5" height="14" font="4">Bla Bla</text>
<text top="144" left="81" width="33" height="18" font="5">Bla Bla</text>
.....
But when I execute the same with the additional parameter -stdout the XML code is mixed up with "link to page xyz":
....
<text top="363" left="81" width="413" height="61" font="1">Bla Bla</text>
<text top="422" left="81" width="514" height="40" font="2">Bla Bla</text>
<text top="1131" left="765" width="72" height="16" font="3">Bla Bla</text>
</page>
link to page 7 link to page 9 link to page 11 link to page 13 link to page 13
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="4" size="8" family="PalatinoLinotype" color="#ffffff"/>
<fontspec id="5" size="11" family="PalatinoLinotype" color="#000000"/>
<fontspec id="6" size="11" family="Arial" color="#000000"/>
<fontspec id="7" size="6" family="TimesNewRomanPSMT" color="#000000"/>
<fontspec id="8" size="9" family="TimesNewRomanPSMT" color="#000000"/>
<text top="104" left="81" width="5" height="14" font="4">Bla Bla</text>
<text top="144" left="81" width="33" height="18" font="5">Bla Bla</text>
.....