pdftotext on page width/height mismatch with the content coordinates on rotated pages
First of all, thank you for creating poppler, it's awesome. We've been utilizing poppler for extracting text from PDFs. There is a small thing (not 100% if it is a bug) that I would like to draw it to your attention. In pdftotext, when processing a rotated page, the page width and height seems to be mismatching with the content coordinates (i.e., block, line, word). This behavior has been in pdftotext for many years, is this indented? Or should it be altered? I've provided our customisation below for your reference.
The following block is an example. You can see the height is 617, but the yMax is above 617.
<page width="831.606000" height="617.569000">
<flow>
<block xMin="39.594500" yMin="103.476200" xMax="208.965220" yMax="756.526200">
<line xMin="49.654500" yMin="103.476200" xMax="208.328117" yMax="112.116200">
<word xMin="49.654500" yMin="103.476200" xMax="60.796260" yMax="112.116200">St.</word>
<word xMin="69.219300" yMin="103.476200" xMax="91.915620" yMax="112.116200">Luke,</word>
<word xMin="100.354500" yMin="103.476200" xMax="121.380010" yMax="112.116200">after</word>
<word xMin="129.754500" yMin="103.476200" xMax="152.905860" yMax="112.116200">being</word>
<word xMin="161.164500" yMin="103.476200" xMax="184.989342" yMax="112.116200">much</word>
<word xMin="193.264500" yMin="103.476200" xMax="208.328117" yMax="112.116200">im-</word>
</line>
<line xMin="40.474500" yMin="111.886200" xMax="208.357140" yMax="120.526200">
<word xMin="40.474500" yMin="111.886200" xMax="72.018180" yMax="120.526200">pressed</word>
<word xMin="76.694500" yMin="111.886200" xMax="95.959076" yMax="120.526200">with</word>
<word xMin="100.434500" yMin="111.886200" xMax="114.159768" yMax="120.526200">the</word>
<word xMin="119.164500" yMin="111.886200" xMax="152.601259" yMax="120.526200">manner</word>
<word xMin="157.234500" yMin="111.886200" xMax="173.199151" yMax="120.526200">and</word>
<word xMin="177.844500" yMin="111.886200" xMax="208.357140" yMax="120.526200">success</word>
</line>
We've been swaping the page width and height if the page is rotated so it matches with the content coorindates.
void printDocBBox(FILE *f, PDFDoc *doc, TextOutputDev *textOut, int first, int last)
{
double xMin, yMin, xMax, yMax;
const TextFlow *flow;
const TextBlock *blk;
const TextLine *line;
fprintf(f, "<doc>\n");
for (int page = first; page <= last; ++page) {
double wid = useCropBox ? doc->getPageCropWidth(page) : doc->getPageMediaWidth(page);
double hgt = useCropBox ? doc->getPageCropHeight(page) : doc->getPageMediaHeight(page);
//----------------------------------------//
// Veridian CUSTOMISATION [START]
//----------------------------------------//
// DisplayPage rotates the page back to 0 degree but the width/height functions do not consider the rotation and as a result it creates a discrepency.
int rot = doc->getPageRotate(page) % 360;
if (rot == 90 || rot == 270) {
double tmp = wid;
wid = hgt;
hgt = tmp;
}
//----------------------------------------//
// Veridian CUSTOMISATION [END]
//----------------------------------------//
fprintf(f, " <page width=\"%f\" height=\"%f\">\n", wid, hgt);
doc->displayPage(textOut, page, resolution, resolution, 0, !useCropBox, useCropBox, false);
for (flow = textOut->getFlows(); flow; flow = flow->getNext()) {
fprintf(f, " <flow>\n");
for (blk = flow->getBlocks(); blk; blk = blk->getNext()) {
blk->getBBox(&xMin, &yMin, &xMax, &yMax);
fprintf(f, " <block xMin=\"%f\" yMin=\"%f\" xMax=\"%f\" yMax=\"%f\">\n", xMin, yMin, xMax, yMax);
for (line = blk->getLines(); line; line = line->getNext()) {
printLine(f, line);
}
fprintf(f, " </block>\n");
}
fprintf(f, " </flow>\n");
}
fprintf(f, " </page>\n");
}
fprintf(f, "</doc>\n");
}