Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
P
poppler
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 615
    • Issues 615
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 38
    • Merge Requests 38
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • poppler
  • poppler
  • Issues
  • #908

Closed
Open
Opened Apr 23, 2020 by Vitaliy Stepanov@vitaliy.stepanov

pdftotext mistakenly recognizes a numbered paragraph as 2 separate text columns

When convert PDF to text with -bbox option some words are not ordered in correct way.

pdftotext -bbox in.pdf out.txt

The page starts like:

        (e)   However, it was not big enough and they
both got wet;
        (f)   Day had broken cold and grey, exceedingly
cold and grey, when the man turned aside from the main
Yukon trail and climbed the high earth- bank, where a dim
and little-travelled trail led eastward through the fat
spruce timberland. It was a steep bank, and he paused for
breath at the top, excusing the act to himself by looking
at his watch.

Result:

  <page width="612.000000" height="792.000000">
    <word xMin="144.000000" yMin="72.488246" xMax="157.293044" yMax="85.747089">(e)</word>
    <word xMin="72.000000" yMin="86.288247" xMax="105.087550" yMax="99.547091">both</word>
    <word xMin="108.237573" yMin="86.288247" xMax="118.214643" yMax="99.547091">got</word>
    <word xMin="121.089283" yMin="86.288247" xMax="166.383594" yMax="99.547091">wet;</word>
    <word xMin="180.000000" yMin="72.488246" xMax="228.735813" yMax="85.747089">However,</word>
    <word xMin="231.969677" yMin="72.488246" xMax="278.643293" yMax="85.747089">it</word>
    <word xMin="281.887310" yMin="72.488246" xMax="327.868256" yMax="85.747089">was</word>
    <word xMin="330.946415" yMin="72.488246" xMax="341.043258" yMax="85.747089">not</word>
    <word xMin="344.277216" yMin="72.488246" xMax="366.241527" yMax="85.747089">big</word>
    <word xMin="369.596988" yMin="72.488246" xMax="398.873472" yMax="85.747089">enough</word>
    <word xMin="402.107336" yMin="72.488246" xMax="432.777301" yMax="85.747089">and</word>
    <word xMin="524.767264" yMin="72.488246" xMax="539.509278" yMax="85.747089">they</word>
    <word xMin="144.000000" yMin="112.088243" xMax="155.965485" yMax="125.347086">(f)</word>
    <word xMin="180.000000" yMin="112.088243" xMax="203.926853" yMax="125.347086">there</word>
    <word xMin="206.691778" yMin="112.088243" xMax="214.678609" yMax="125.347086">is</word>
    ...
  </page>

Expected result is when words 2-4 with yMin="86.288247" will be between yMin="72.488246" and yMin="112.088243" ones.

It happens because it recognizes 2 first lines as side-by-side text (2 flows with one block in each of them), but not as one simple block (paragraph). I guess, becasue (xMax of second line) < (xMin of word However, in first line)

Edited May 05, 2020 by Vitaliy Stepanov
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: poppler/poppler#908