pdftohtml: add image and font extraction
Submitted by Joshua Richardson
Assigned to poppler-bugs
Description
Created attachment 49314 patches to resolve this enhancement
Now, instead of generating one large background image per page, smaller images, just large enough to capture the graphical elements are generated, and only on pages where there actually are graphical elements.
In addition, the images may optionally be generated at a larger viewing size, so that the images can be viewed larger, but they still show up the same size in the generated HTML. We added a new "-dpi" switch for this.
Finally, there is a new option "-embedfonts" which will make the generated html utilize extracted fonts. (For now, you'll have to use another utility like mu pdfextract to actually extract those fonts.)
These features were all built in parallel, so the easiest way to merge them back to the poppler public repository will probably be to apply all the patches together in order. In the tarball, I've also included other patches from the public repository that happened while we were developing. This is in the hope that it will be easier to figure out the right way to apply the patches in order without conflicts. But so as not to be confused, the only patches that are relevant to this enhancement bug are the ones authored by Joshua Richardson or Stephen Reichling.
Attachment 49314, "patches to resolve this enhancement":
img-font-extract-patches.tgz