Map WPTypographicSymbols characters to Unicode equivalents
(Downstream issue: freelawproject/courtlistener#937)
PDFs encountered in the wild sometimes use a font called "WP Typographic Symbols", a font shipped with WordPerfect which allowed various typographic symbols to be rendered at a time when 8-bit character sets ruled the world. This font asks the user to enter A
when they want to display “
(which would now be represented as U+201C left double quotation mark), @
when they want to display ”
, =
when they want to display ’
, etc.
This causes problems with legal documents, as the legal field has favored WordPerfect for many years and produced many PDFs using this font. For example, this section of an arbitrary court decision:
pdftotext
correctly sees that the document contains A
, but doesn't know anything about this WordPerfect font, so it doesn't know that their intent is to communicate “
. This section converts into:
The court Areject[s] the dissent=s attempt@ (slip op. at 11) to
distinguish Estelle. The court=s rejection is made without any real
discussion of the salient points of the United States Supreme Court=s
analysis as it fails to discuss the importance the invited-error doctrine
had on the United State Supreme Court=s analysis in refusing to
excuse the defendant=s procedural default.
pdftotext
should instead output what the author means when they write characters using WP Typographic Symbols:
The court “reject[s] the dissent’s attempt” (slip op. at 11) to
distinguish Estelle. The court’s rejection is made without any real
discussion of the salient points of the United States Supreme Court’s
analysis as it fails to discuss the importance the invited-error doctrine
had on the United State Supreme Court’s analysis in refusing to
excuse the defendant’s procedural default.
I went through the font to make a table mapping symbols in this font to their Unicode counterparts, and I created a patch special-casing this font. This patch produced the output above.
I intend to clean up that patch and open a merge request which will reference this issue. I'm opening this issue first to distinguish discussion of the problem from discussion of my particular solution.