pdftohtml -xml type 3 font without size / height on result
I am facing a zero height font and bounding box situation when converting a PDF with embedded type 3 fonts. It is clear what the root cause of this incorrect height is - it is called out with a comment in the code (and these fonts do not include an embedded 'm' character, so the current hack fails) - https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/utils/HtmlOutputDev.cc#L306 :
...
// This is a hack which makes it possible to deal with some Type 3
// fonts. The problem is that it's impossible to know what the
// base coordinate system used in the font is without actually
// rendering the font. This code tries to guess by looking at the
// width of the character 'm' (which breaks if the font is a
// subset that doesn't contain 'm').
for (code = 0; code < 256; ++code) {
if ((name = ((Gfx8BitFont *)font)->getCharName(code)) && name[0] == 'm' && name[1] == '\0') {
break;
}
}
if (code < 256) {
w = ((Gfx8BitFont *)font)->getWidth(code);
if (w != 0) {
// 600 is a generic average 'm' width -- yes, this is a hack
fontSize *= w / 0.6;
}
}
...
I am wondering if I should expand this sort of hack to include more characters to give it a better chance to set this fontSize value, or possibly use the glyphs themselves to produce this fontSize value?