"pdfimages -list" gets confused with nested images

I am using pdfimages 0.86.1 that comes with Ubuntu 20.04.3.

I have a PDF file (unfortunately with copyrighted contents) that has been produced by a commercial scanning and OCR software solution.

Each scanned page yields a PDF page with 3 pictures. It seems that the software identifies the text areas and saves them as a single high-resolution monochrome image. This image is missing those areas identified as not text, for example, drawings or colour pictures. Those missing areas land in 2 separate pictures.

The result is a very small PDF file size which can be very accurately OCR'ed. A have seen another commercial OCR software that does a similar thing.

By the way, does this method of breaking up a scanned picture for OCR and space-saving purposes have a name? I couldn't find any open-source tool like OCRmyPDF or Ghostscript that is able to do that separation. Normally, you get one picture per page, so you do not manage to shrink the PDFs so much.

When you view the scanned PDF, you do not realise that there are 3 pictures per page. I guess the images are transparent and stacked on top of each other.

Command "pdfimages -list" shows:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     827  1169  gray    1   8  jpeg   no         4  0   101   100 43.2K 4.6%
   1     1 image     827  1169  gray    1   8  jpeg   no         6  0   101   100 11.0K 1.2%
   1     2 mask     2481  3508  -       1   1  jpeg   no         6  0   301   300 11.0K 1.0%
[... entries for other PDF pages ...]

Note that "object" value 6 is duplicated, which I think it should not be.

If you extract with "pdfimages -all", you get:

51.766 -000.jpg
19.231 -001.jpg
30.995 -002.ccitt
    20 -002.params
[... files for other PDF pages ...]

Note that image 2 is of type CCITT, and not JPEG as stated in the table above.

I do not know much about PDF files, but I recently learned that you can dump a pdf with command "dumppdf -a". I then searched the object IDs, and this is what I found:

<object id="6">
<stream>
<props>
<dict size="10">
<key>BitsPerComponent</key>
<value><number>8</number></value>
<key>ColorSpace</key>
<value><literal>DeviceGray</literal></value>
<key>Filter</key>
<value><list size="2">
<literal>FlateDecode</literal>
<literal>DCTDecode</literal>
</list></value>
<key>Height</key>
<value><number>1169</number></value>
<key>Length</key>
<value><number>11222</number></value>
<key>Mask</key>
<value><ref id="5" /></value>
<key>Name</key>
<value><literal>image_fg1</literal></value>
<key>Subtype</key>
<value><literal>Image</literal></value>
<key>Type</key>
<value><literal>XObject</literal></value>
<key>Width</key>
<value><number>827</number></value>
</dict>
</props>
</stream>
</object>

This object ID 6 probably corresponds to file "-001.jpg".

Note that another object is referenced like this:

<value><ref id="5" /></value>

That referenced object is defined as follows:

<object id="5">
<stream>
<props>
<dict size="10">
<key>BitsPerComponent</key>
<value><number>1</number></value>
<key>DecodeParms</key>
<value><dict size="2">
<key>Columns</key>
<value><number>2481</number></value>
<key>K</key>
<value><number>-1</number></value>
</dict></value>
<key>Filter</key>
<value><literal>CCITTFaxDecode</literal></value>
<key>Height</key>
<value><number>3508</number></value>
<key>ImageMask</key>
<value><number>True</number></value>
<key>Length</key>
<value><number>30995</number></value>
<key>Name</key>
<value><literal>image_sel1</literal></value>
<key>Subtype</key>
<value><literal>Image</literal></value>
<key>Type</key>
<value><literal>XObject</literal></value>
<key>Width</key>
<value><number>2481</number></value>
</dict>
</props>
</stream>
</object>

So that object is probably the one that generates files "-002.ccitt" and "-002.params".

I guess that this kind of "nesting" between object ID 6 and object ID 5 is confusing "pdfimages -list", which is generating a table with incorrect information for image number 2.

But "pdfimages -all" does not get confused, for it is extracting the expected files with the expected image types.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information