Map WPTypographicSymbols characters to Unicode equivalents
(Downstream issue: freelawproject/courtlistener#937)
PDFs encountered in the wild sometimes use a font called "WP Typographic Symbols", a font shipped with WordPerfect which allowed various typographic symbols to be rendered at a time when 8-bit character sets ruled the world. This font asks the user to enter
A when they want to display
“ (which would now be represented as U+201C left double quotation mark),
@ when they want to display
= when they want to display
This causes problems with legal documents, as the legal field has favored WordPerfect for many years and produced many PDFs using this font. For example, this section of an arbitrary court decision:
pdftotext correctly sees that the document contains
A, but doesn't know anything about this WordPerfect font, so it doesn't know that their intent is to communicate
“. This section converts into:
The court Areject[s] the dissent=s attempt@ (slip op. at 11) to distinguish Estelle. The court=s rejection is made without any real discussion of the salient points of the United States Supreme Court=s analysis as it fails to discuss the importance the invited-error doctrine had on the United State Supreme Court=s analysis in refusing to excuse the defendant=s procedural default.
pdftotext should instead output what the author means when they write characters using WP Typographic Symbols:
The court “reject[s] the dissent’s attempt” (slip op. at 11) to distinguish Estelle. The court’s rejection is made without any real discussion of the salient points of the United States Supreme Court’s analysis as it fails to discuss the importance the invited-error doctrine had on the United State Supreme Court’s analysis in refusing to excuse the defendant’s procedural default.
I intend to clean up that patch and open a merge request which will reference this issue. I'm opening this issue first to distinguish discussion of the problem from discussion of my particular solution.