"pdfimages -list" gets confused with nested images
I am using pdfimages 0.86.1 that comes with Ubuntu 20.04.3.
I have a PDF file (unfortunately with copyrighted contents) that has been produced by a commercial scanning and OCR software solution.
Each scanned page yields a PDF page with 3 pictures. It seems that the software identifies the text areas and saves them as a single high-resolution monochrome image. This image is missing those areas identified as not text, for example, drawings or colour pictures. Those missing areas land in 2 separate pictures.
The result is a very small PDF file size which can be very accurately OCR'ed. A have seen another commercial OCR software that does a similar thing.
By the way, does this method of breaking up a scanned picture for OCR and space-saving purposes have a name? I couldn't find any open-source tool like OCRmyPDF or Ghostscript that is able to do that separation. Normally, you get one picture per page, so you do not manage to shrink the PDFs so much.
When you view the scanned PDF, you do not realise that there are 3 pictures per page. I guess the images are transparent and stacked on top of each other.
Command "pdfimages -list" shows:
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 827 1169 gray 1 8 jpeg no 4 0 101 100 43.2K 4.6%
1 1 image 827 1169 gray 1 8 jpeg no 6 0 101 100 11.0K 1.2%
1 2 mask 2481 3508 - 1 1 jpeg no 6 0 301 300 11.0K 1.0%
[... entries for other PDF pages ...]
Note that "object" value 6 is duplicated, which I think it should not be.
If you extract with "pdfimages -all", you get:
51.766 -000.jpg
19.231 -001.jpg
30.995 -002.ccitt
20 -002.params
[... files for other PDF pages ...]
Note that image 2 is of type CCITT, and not JPEG as stated in the table above.
I do not know much about PDF files, but I recently learned that you can dump a pdf with command "dumppdf -a". I then searched the object IDs, and this is what I found:
<object id="6">
<stream>
<props>
<dict size="10">
<key>BitsPerComponent</key>
<value><number>8</number></value>
<key>ColorSpace</key>
<value><literal>DeviceGray</literal></value>
<key>Filter</key>
<value><list size="2">
<literal>FlateDecode</literal>
<literal>DCTDecode</literal>
</list></value>
<key>Height</key>
<value><number>1169</number></value>
<key>Length</key>
<value><number>11222</number></value>
<key>Mask</key>
<value><ref id="5" /></value>
<key>Name</key>
<value><literal>image_fg1</literal></value>
<key>Subtype</key>
<value><literal>Image</literal></value>
<key>Type</key>
<value><literal>XObject</literal></value>
<key>Width</key>
<value><number>827</number></value>
</dict>
</props>
</stream>
</object>
This object ID 6 probably corresponds to file "-001.jpg".
Note that another object is referenced like this:
<value><ref id="5" /></value>
That referenced object is defined as follows:
<object id="5">
<stream>
<props>
<dict size="10">
<key>BitsPerComponent</key>
<value><number>1</number></value>
<key>DecodeParms</key>
<value><dict size="2">
<key>Columns</key>
<value><number>2481</number></value>
<key>K</key>
<value><number>-1</number></value>
</dict></value>
<key>Filter</key>
<value><literal>CCITTFaxDecode</literal></value>
<key>Height</key>
<value><number>3508</number></value>
<key>ImageMask</key>
<value><number>True</number></value>
<key>Length</key>
<value><number>30995</number></value>
<key>Name</key>
<value><literal>image_sel1</literal></value>
<key>Subtype</key>
<value><literal>Image</literal></value>
<key>Type</key>
<value><literal>XObject</literal></value>
<key>Width</key>
<value><number>2481</number></value>
</dict>
</props>
</stream>
</object>
So that object is probably the one that generates files "-002.ccitt" and "-002.params".
I guess that this kind of "nesting" between object ID 6 and object ID 5 is confusing "pdfimages -list", which is generating a table with incorrect information for image number 2.
But "pdfimages -all" does not get confused, for it is extracting the expected files with the expected image types.