Improving XML output
When I execute
pdftohtml -xml
with this document: https://gitlab.freedesktop.org/poppler/poppler/uploads/53bbf4ef6dea96782381ffc79765093e/Manual_InCD.pdf
I get a lot of empty <text> nodes which have only one space in them.
I think it should be easy to check if a node has no text before writing it to the XML output.
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="11" family="FKOLKI+Arial,Bold" color="#000000"/>
<fontspec id="1" size="18" family="FKOLKI+Arial,Bold" color="#000000"/>
<fontspec id="2" size="14" family="FKOMCM+Arial" color="#000000"/>
<image top="461" left="463" width="374" height="102" src="manual-1_1.png"/>
<text top="1119" left="108" width="4" height="15" font="0"><b> </b></text>
<text top="191" left="712" width="131" height="23" font="1"><b>User Manual </b></text>
<text top="215" left="837" width="5" height="18" font="2"> </text>
<text top="234" left="837" width="5" height="18" font="2"> </text>
<text top="253" left="837" width="5" height="18" font="2"> </text>
<text top="272" left="837" width="5" height="18" font="2"> </text>
<text top="291" left="837" width="5" height="18" font="2"> </text>
<text top="310" left="837" width="5" height="18" font="2"> </text>
<text top="329" left="837" width="5" height="18" font="2"> </text>
<text top="348" left="837" width="5" height="18" font="2"> </text>
<text top="367" left="837" width="5" height="18" font="2"> </text>
<text top="386" left="837" width="5" height="18" font="2"> </text>
<text top="405" left="837" width="5" height="18" font="2"> </text>
<text top="424" left="837" width="5" height="18" font="2"> </text>
<text top="443" left="837" width="5" height="18" font="2"> </text>
<text top="548" left="837" width="5" height="18" font="2"> </text>
<text top="564" left="837" width="5" height="18" font="2"> </text>
<text top="583" left="837" width="5" height="18" font="2"> </text>
<text top="602" left="837" width="5" height="18" font="2"> </text>
<text top="621" left="837" width="5" height="18" font="2"> </text>
<text top="639" left="837" width="5" height="18" font="2"> </text>
<text top="658" left="837" width="5" height="18" font="2"> </text>
<text top="677" left="837" width="5" height="18" font="2"> </text>
<text top="696" left="837" width="5" height="18" font="2"> </text>
<text top="715" left="837" width="5" height="18" font="2"> </text>
<text top="734" left="837" width="5" height="18" font="2"> </text>
<text top="753" left="837" width="5" height="18" font="2"> </text>
<text top="772" left="837" width="5" height="18" font="2"> </text>
<text top="791" left="837" width="5" height="18" font="2"> </text>
<text top="810" left="837" width="5" height="18" font="2"> </text>
<text top="829" left="837" width="5" height="18" font="2"> </text>
<text top="848" left="837" width="5" height="18" font="2"> </text>
<text top="867" left="837" width="5" height="18" font="2"> </text>
<text top="886" left="837" width="5" height="18" font="2"> </text>
<text top="905" left="837" width="5" height="18" font="2"> </text>
<text top="942" left="641" width="202" height="23" font="1"><b>Ahead Software AG </b></text>
</page>