Some unusual PDFs hang pdftotext indefinitely with growing memory usage
I occasionally have PDFs of data automatically generated from scripts.
One of these PDFs (linked below) is rather complex and a few minutes to load in a PDF viewer. It's effectively garbage, but as far as I know it's a legitimate PDF file. Unfortunately, when pdftotext is used to try to extract text from it, it hangs indefinitely, and memory usage grows slowly. I've seen usage as high as 3 GB after about an hour.
I can appreciate that this PDF is probably very hard to parse, but hanging for that long and using that much memory for a PDF that contains virtually no text is probably unexpected?
You might reasonably ask - why try using pdftotext on this file? Why not simply not do that? The problem is that this issue shows up for downstream poppler users. I actually encountered this problem because KDE's file indexer (Baloo) hangs indefinitely when trying to index this file. Obviously, a file indexer is something that will try to extract text from every file, and in this case it's using poppler to do it.
Test file: https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
Warning: I suggest downloading this file rather than trying to open it in your browser. If you have a virus scanner or file indexer that uses poppler (or any other affected program) and it tries to open the file automatically, the program may hang as a result.
I am placing the file linked above in the public domain. If this is not legally possible in your jurisdiction, I am licensing the file to you under (your choice of) CC0, Zero-Clause BSD, or GNU Free Documentation License.
Relevant system information:
Poppler: 21.10.0-1
Source: Distribution Package
OS: Arch Linux x86_64