Skip to content

WIP: Ref 888: Enhanced PDF to text parsing to merge boxed informaton into a single word

Currently Poppler pdftotext parser tends to extract filled box information into multiple characters, as explained in the issue: #888 #Fixes: #888

In order to fix this, logic have been added to merge single characters in the same word, if they lie within the threshold.

Sample PDF: application_form_undergraduate_2.pdf

Old Parsing Output: old_parsing_output.txt

New Parsing Output: new_parsing_output.txt

Edited by Syed Osama

Merge request reports