Super/subscripts break correspondence between poppler_page_get_text and poppler_page_get_text_layout (glib)
According to the glib documentation the position in the array returned by poppler_page_get_text_layout() represents an offset in the text returned by poppler_page_get_text(). However this isn't the case if the pdf contains super- or subscripts.
I can't program in C so the below python code is the closest example I can give (I've tried another glib wrapper though with the same results). The code is run on this example pdf popplertest.pdf.
As can be seen, poppler_page_get_text_layout() returns very small regions for the super- and subscripts, which are not reflected by anything returned by poppler_page_get_text(), causing the subsequent text to have the wrong offset compared to poppler_page_get_text_layout(). In general it is not a solution to just remove all such small regions, since in some pdf's they might, for example, be larger than punctuation in a footnote.
# PyGObject
## pip install PyGObject # version 3.40.1
## poppler 21.07.0_1 / gir-1.0 / Poppler-0.18.gir
import gi
gi.require_version("Poppler", "0.18")
from gi.repository import Poppler
fileuri = "file:///Users/x/Downloads/popplertest.pdf"
document = Poppler.Document.new_from_file(fileuri)
page = document.get_page(0)
ptext = Poppler.Page.get_text(page)
ptlayout = Poppler.Page.get_text_layout(page)[1]
len(ptext) # 26
len(ptlayout) # 28
help(ptlayout[0])
print(" x1 x2 y1 y2 x2-x1 y2-y1")
for i in range(len(ptext)):
print(f"{repr(ptext[i]):4s} {ptlayout[i].x1:8.1f} {ptlayout[i].x2:8.1f} {ptlayout[i].y1:8.1f} {ptlayout[i].y2:8.1f} {ptlayout[i].x2 - ptlayout[i].x1:8.4f} {ptlayout[i].y2 - ptlayout[i].y1:8.2f}")
print(" x1 x2 y1 y2 x2-x1 y2-y1")
for i in range(len(ptext), len(ptlayout)):
print(f" {ptlayout[i].x1:8.1f} {ptlayout[i].x2:8.1f} {ptlayout[i].y1:8.1f} {ptlayout[i].y2:8.1f} {ptlayout[i].x2 - ptlayout[i].x1:8.4f} {ptlayout[i].y2 - ptlayout[i].y1:8.2f}")
>>>
x1 x2 y1 y2 x2-x1 y2-y1
'T' 290.0 300.4 172.5 187.7 10.3413 15.20
'e' 300.4 307.6 172.5 187.7 7.1874 15.20
's' 307.6 313.9 172.5 187.7 6.3783 15.20
't' 313.9 320.2 172.5 187.7 6.2888 15.20
'\n' 320.2 320.2 187.7 187.7 0.0000 0.00
'A' 142.7 150.9 248.8 258.5 8.1306 9.63
'B' 150.9 158.5 248.8 258.5 7.6800 9.63
'C' 158.5 166.4 248.8 258.5 7.8327 9.63
'0' 166.4 166.4 248.8 258.5 -0.0003 9.63
'1' 166.4 170.6 252.5 259.5 4.2345 7.08
' ' 170.6 174.8 252.5 259.5 4.2345 7.08
'D' 174.8 179.0 252.5 259.5 4.1150 7.08
'E' 179.0 187.2 248.8 258.5 8.2822 9.63
'F' 187.2 194.6 248.8 258.5 7.3789 9.63
'\n' 194.6 201.7 248.8 258.5 7.0778 9.63
'G' 201.7 201.7 258.5 258.5 0.0000 0.00
'H' 125.8 134.3 262.4 272.0 8.5091 9.63
'I' 134.3 142.4 262.4 272.0 8.1306 9.63
'2' 142.4 146.4 262.4 272.0 3.9153 9.63
'3' 146.4 146.4 262.4 272.0 0.0001 9.63
' ' 146.4 150.6 260.4 267.5 4.2345 7.08
'J' 150.6 154.8 260.4 267.5 4.2345 7.08
'K' 154.8 158.9 260.4 267.5 4.1150 7.08
'L' 158.9 164.5 262.4 272.0 5.5724 9.63
'\n' 164.5 172.9 262.4 272.0 8.4316 9.63
'1' 172.9 179.7 262.4 272.0 6.7767 9.63
x1 x2 y1 y2 x2-x1 y2-y1
179.7 179.7 272.0 272.0 0.0000 0.00
302.4 307.8 691.5 701.2 5.4240 9.63
>>>