pdftotext: UTF-16 text without BOM not properly extracted

Submitted by ral..@..te.com

Assigned to poppler-bugs

Description

Created attachment 134881 Sample file

When I use pdftotext with the attached sample file I get no usable text. When looking at the file with a hex editor, I can see that the text is available as UTF-16BE without BOM. The display with xpdf is fine.

Tested with version 0.48.0 (Debian Stable) and 0.57.0 (Debian Testing).

Attachment 134881, "Sample file":
2004.pdf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information