pdftohtml -xml fails to extract text that is extracted in pdftotext
Submitted by Petter Reinholdtsen
Assigned to poppler-bugs
Description
When I convert http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf to XML using
pdftohtml -xml -noframes 1.8116520!offentligjournal02052012.pdf
I get the following content-less XML file. I find this rather strange, as the PDF is searchable using xpdf, okular and evince. Any idea where the text went? Anything I can do to get access to the text as XML?
This is the output I get:
<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792" width="612">
</page>
</pdf2xml>
This problem is also reported to Debian as http://bugs.debian.org/676238