pdftotext incorrect characters
When attempting to extract text from a PDF file, I'm getting a lot of strange characters that are far from the actual text contained in the pdf (a lot of control unicode characters for example). Unfortunately I can't share the pdf, but I'll share some information that could be helpful:
- PDF was generated by Quartz, from macOS. Probably using the preview feature.
- I've tracked down the version that this issue was introduced: 21.03
- Same happens with pdftohtml.
- Fonts are embedded, so it's not due to missing OS fonts.
In this version the following change was introduced:
Fix parsing text in some broken pdf files
. Could this be a consequence of that bug fix? Is it expected that sometimes incorrect characters are sent to output?
Let me know if I can provide any more information, thanks for reading!