Text extraction should expand ligatures to their normal form
There was this old bug on Bugzilla saying "pdftotext and copy-n-paste from a document should expand ligatures such as fi to the letters f and i.", which was fixed in 2012 in commit 33615643.
But I can still see such ligatures generated by pdftotext
, e.g. on the following PDF file (generated by Ghostscript's ps2pdf
): chartest3-gs.pdf
In short, I get "Don’t ff." (with U+FB00 LATIN SMALL LIGATURE FF) instead of "Don’t ff." (with 2 letters "f").