pdftotext mistakenly recognizes a numbered paragraph as 2 separate text columns
When convert PDF to text with -bbox
option some words are not ordered in correct way.
pdftotext -bbox in.pdf out.txt
The page starts like:
(e) However, it was not big enough and they
both got wet;
(f) Day had broken cold and grey, exceedingly
cold and grey, when the man turned aside from the main
Yukon trail and climbed the high earth- bank, where a dim
and little-travelled trail led eastward through the fat
spruce timberland. It was a steep bank, and he paused for
breath at the top, excusing the act to himself by looking
at his watch.
Result:
<page width="612.000000" height="792.000000">
<word xMin="144.000000" yMin="72.488246" xMax="157.293044" yMax="85.747089">(e)</word>
<word xMin="72.000000" yMin="86.288247" xMax="105.087550" yMax="99.547091">both</word>
<word xMin="108.237573" yMin="86.288247" xMax="118.214643" yMax="99.547091">got</word>
<word xMin="121.089283" yMin="86.288247" xMax="166.383594" yMax="99.547091">wet;</word>
<word xMin="180.000000" yMin="72.488246" xMax="228.735813" yMax="85.747089">However,</word>
<word xMin="231.969677" yMin="72.488246" xMax="278.643293" yMax="85.747089">it</word>
<word xMin="281.887310" yMin="72.488246" xMax="327.868256" yMax="85.747089">was</word>
<word xMin="330.946415" yMin="72.488246" xMax="341.043258" yMax="85.747089">not</word>
<word xMin="344.277216" yMin="72.488246" xMax="366.241527" yMax="85.747089">big</word>
<word xMin="369.596988" yMin="72.488246" xMax="398.873472" yMax="85.747089">enough</word>
<word xMin="402.107336" yMin="72.488246" xMax="432.777301" yMax="85.747089">and</word>
<word xMin="524.767264" yMin="72.488246" xMax="539.509278" yMax="85.747089">they</word>
<word xMin="144.000000" yMin="112.088243" xMax="155.965485" yMax="125.347086">(f)</word>
<word xMin="180.000000" yMin="112.088243" xMax="203.926853" yMax="125.347086">there</word>
<word xMin="206.691778" yMin="112.088243" xMax="214.678609" yMax="125.347086">is</word>
...
</page>
Expected result is when words 2-4 with yMin="86.288247"
will be between yMin="72.488246"
and yMin="112.088243"
ones.
It happens because it recognizes 2 first lines as side-by-side text (2 flows with one block in each of them), but not as one simple block (paragraph).
I guess, becasue (xMax of second line) < (xMin of word However,
in first line)