pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20

Submitted by Daniel Flipo

Assigned to poppler-bugs

Description

Created attachment 134154 PDF file with non-breaking spaces to be preserved

Correction of bug #97399 lead to add non-breaking spaces U+A0 and U+202F to function UnicodeIsWhitespace which holds the list of all spaces used to break lines into words.

As a result, these non-breaking spaces are converted into breakable U+20 spaces by pdftotext. In some cases (ties like Mr Bean, high punctuation in French, etc.) these non-breaking spaces are intentionally added and should be preserved as such in the text or html output.

An option to pdftotext enabling to remove these two spaces from UnicodeIsWhitespace would solve the issue.

I append a a small PDF file with those non-breaking spaces for testing.

Attachment 134154, "PDF file with non-breaking spaces to be preserved":
spaces.pdf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information