WIP: Ref 888: Enhanced PDF to text parsing to merge boxed informaton into a single word
requested to merge syedosamaanwer/poppler:ENHANCEMENT_pdf_to_text_merge_boxed_text_spacing into master
Currently Poppler pdftotext parser tends to extract filled box information into multiple characters, as explained in the issue: #888 #Fixes: #888
In order to fix this, logic have been added to merge single characters in the same word, if they lie within the threshold.
Sample PDF: application_form_undergraduate_2.pdf
Old Parsing Output: old_parsing_output.txt
New Parsing Output: new_parsing_output.txt
Edited by Albert Astals Cid