Glyphs in PDFs produced by Tesseract OCR render as white boxes when selected
Submitted by James R Barlow
Assigned to poppler-bugs
Created attachment 140931 Test file
Tesseract OCR uses a glyphless font (a font with a single glyph that occupies empty space) in the PDFs it produces.
When PDFs produced by Tesseract are rendered in and text is selected, Poppler draws white boxes over top of the background image that contains the text. The Tesseract team has worked pretty hard on PDF viewer support and compatibility - to my knowledge the Tesseract glyphless font works correctly in Acrobat, Pdfium, PDF.js, macOS Preview, Dropbox PDF Viewer, MuPDF and Ghostscript; with multiple platform and including mobile testing. Other PDF viewers do not attempt to render the glyphless font on top of the background.
This was first reported against Evince, which claims the issue is in Poppler. https://gitlab.gnome.org/GNOME/evince/issues/953
See that issue for screenshots as no screenshots can be added easily here.
The design notes of the glyphless font may be relevant. https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp
Attachment 140931, "Test file":