poppler issueshttps://gitlab.freedesktop.org/poppler/poppler/-/issues2023-11-24T13:42:01Zhttps://gitlab.freedesktop.org/poppler/poppler/-/issues/1106pdfimages: options to add a container to JBIG2 and CCITT data2023-11-24T13:42:01ZShai4shepdfimages: options to add a container to JBIG2 and CCITT dataThe current extraction of (embedded) JBIG2 (stream) does not include any header (like `0xFF 0xD8` for JPEG), at least when there is no global data. The jbig2 output produced by [jbig2enc](https://github.com/agl/jbig2enc) will add a heade...The current extraction of (embedded) JBIG2 (stream) does not include any header (like `0xFF 0xD8` for JPEG), at least when there is no global data. The jbig2 output produced by [jbig2enc](https://github.com/agl/jbig2enc) will add a header to indicate that this is a JBIG2 file. Although this might not be standardized, it is helpful to add such a header so that it could be passed to subsequent applications like `img2pdf`.
Similar for CCITT: it seems better to have an option to contain the CCITT into a TIFF file without any conversion, but just including a container to facilitate the subsequent processing. Unlike `-tiff` option, it will not convert everything else to TIFF, nor perform any conversion between different types of TIFFs.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1188pdfimages -all incorrectly extracts DCT encoded CMYK images2021-12-26T21:01:22ZStirling Westruppdfimages -all incorrectly extracts DCT encoded CMYK imagesCurrently, if pdfimages (as of v21.12.0) is given a .pdf file which contains CMYK colorspace images that are DCT compressed, and the flag '-all' it will extract those images as .jpg files with mangled colors. However, if given the flag '...Currently, if pdfimages (as of v21.12.0) is given a .pdf file which contains CMYK colorspace images that are DCT compressed, and the flag '-all' it will extract those images as .jpg files with mangled colors. However, if given the flag '-tiff' it correctly extracts those images as CMYK .tiff files.
pdfimages should either respond to '-all' by producing CMYK .jpg files with the correct colors (preferred) or should produce .tif files for these images.
The attached file is page-2 of a free-to-download product from Heroic Maps, which illustrates the issue.
[interior-02.pdf](/uploads/c4fa4af617f3ece16ecd4dfc0e562cba/interior-02.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/248pdfimages failed to list/extract image2020-01-20T12:43:03ZBugzilla Migration Userpdfimages failed to list/extract image## Submitted by a24..@...co.jp
Assigned to **poppler-bugs**
**[Link to original bug (#91734)](https://bugs.freedesktop.org/show_bug.cgi?id=91734)**
## Description
Created attachment 117873
PDF that pdfimages failed to detect image...## Submitted by a24..@...co.jp
Assigned to **poppler-bugs**
**[Link to original bug (#91734)](https://bugs.freedesktop.org/show_bug.cgi?id=91734)**
## Description
Created attachment 117873
PDF that pdfimages failed to detect image
pdfimages failed to detect images for the attached PDF file.
In PDF file, one /Subtype /Image object exists.
It can be extracted with mutool (1.7a), and even pdftohtml comes from poppler generate png for that part.
$ pdfimages -v
pdfimages version 0.35.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
$ pdfimages -list sample.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
$ egrep -C 3 '/Subtype /Image' sample.pdf
12 0 obj
<<
/Subtype /Image
/ColorSpace /DeviceCMYK
/Width 32
/Height 32
**Attachment 117873**, "PDF that pdfimages failed to detect image":
[sample.pdf](/uploads/012fdd6473514a330bfa096885acd320/sample.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/726Pdf that is made of scans has all images with the wrong dimensions2019-02-27T22:39:29ZmirhPdf that is made of scans has all images with the wrong dimensions[File](https://fordham.bepress.com/cgi/viewcontent.cgi?article=1005&context=phil_babich).
[pdftohtml](/uploads/529cbfb07e4f6cc9b21a0b40f12e2c13/pdftohtml.png)
[expected](/uploads/c1fefcb7b652eecb56e95ecee99532d9/expected.png)
Xpdf, fi...[File](https://fordham.bepress.com/cgi/viewcontent.cgi?article=1005&context=phil_babich).
[pdftohtml](/uploads/529cbfb07e4f6cc9b21a0b40f12e2c13/pdftohtml.png)
[expected](/uploads/c1fefcb7b652eecb56e95ecee99532d9/expected.png)
Xpdf, firefox, sumatra and some quite old version of evince I found, all seem to work and display it correctly.https://gitlab.freedesktop.org/poppler/poppler/-/issues/663Clearer wording in documentation2018-11-08T10:00:05ZJeanClearer wording in documentation`Usage: pdfimages [options] <PDF-file> <image-root>`
isn't self-explanatory enough to understand at first read that <image-root> can EITHER be a `path/image-prefix-name` OR an `image-prefix-name` only.
Only by coincidence or looking at...`Usage: pdfimages [options] <PDF-file> <image-root>`
isn't self-explanatory enough to understand at first read that <image-root> can EITHER be a `path/image-prefix-name` OR an `image-prefix-name` only.
Only by coincidence or looking at online example can you figure this out.https://gitlab.freedesktop.org/poppler/poppler/-/issues/662Extract images from selected type only2018-11-07T21:36:18ZJeanExtract images from selected type onlyUsing pdfimages version 0.71.0
This is more of a feature request than a bug.
It'd be handy to be able to extract a specific type only, like an option `-t image/stencil/etc.` that would only extract when type matches.
Also being able to...Using pdfimages version 0.71.0
This is more of a feature request than a bug.
It'd be handy to be able to extract a specific type only, like an option `-t image/stencil/etc.` that would only extract when type matches.
Also being able to specify which picture to extract within a page.https://gitlab.freedesktop.org/poppler/poppler/-/issues/452pdfimages should extract resolution information2018-10-11T20:19:30ZBugzilla Migration Userpdfimages should extract resolution information## Submitted by Torsten Bronger
Assigned to **poppler-bugs**
**[Link to original bug (#38549)](https://bugs.freedesktop.org/show_bug.cgi?id=38549)**
## Description
Currently, the dpi resolution of images extracted by pdfimages is ...## Submitted by Torsten Bronger
Assigned to **poppler-bugs**
**[Link to original bug (#38549)](https://bugs.freedesktop.org/show_bug.cgi?id=38549)**
## Description
Currently, the dpi resolution of images extracted by pdfimages is lost. pdfimages should embed the original resolution in the extracted JPEGs. Additionally, TIFF should be an alternative to PBM/PPM since TIFFs can contain resolution information, too.https://gitlab.freedesktop.org/poppler/poppler/-/issues/510pdfimages 0.62 extract image at low resolution than embedded in PDF2018-10-11T20:17:54ZBugzilla Migration Userpdfimages 0.62 extract image at low resolution than embedded in PDF## Submitted by Valerio Messina
Assigned to **poppler-bugs**
**[Link to original bug (#104684)](https://bugs.freedesktop.org/show_bug.cgi?id=104684)**
## Description
Created attachment 136828
sample PDF with 4 pages
using pdfimag...## Submitted by Valerio Messina
Assigned to **poppler-bugs**
**[Link to original bug (#104684)](https://bugs.freedesktop.org/show_bug.cgi?id=104684)**
## Description
Created attachment 136828
sample PDF with 4 pages
using pdfimages and extracting the images from the attached 4 pages PDF, generate tens of small useless files and 4 real images, but also those images really at very low resolution, so text is unreadable.
$ pdfimages -all FPGA_CQFP352adapter_Aldec_orig.pdf FPGA_CQFP352adapter_Aldec
platform:
Linux64 and Win64
**Attachment 136828**, "sample PDF with 4 pages":
[FPGA_CQFP352adapter_Aldec_orig.pdf](/uploads/e864ea34e5141e468188394fe7866310/FPGA_CQFP352adapter_Aldec_orig.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/600pdfimages extracts lots of same images with the same object number.2018-10-11T08:57:16ZBugzilla Migration Userpdfimages extracts lots of same images with the same object number.## Submitted by 石印
Assigned to **poppler-bugs**
**[Link to original bug (#99883)](https://bugs.freedesktop.org/show_bug.cgi?id=99883)**
## Description
Created attachment 129787
problem file
I have a pdf file, pdfimages list a lot...## Submitted by 石印
Assigned to **poppler-bugs**
**[Link to original bug (#99883)](https://bugs.freedesktop.org/show_bug.cgi?id=99883)**
## Description
Created attachment 129787
problem file
I have a pdf file, pdfimages list a lot of images with the object number. These images are the same. There are only about a thousand pictures with diffrent object number, but pdfimages list more than 256,000 items. Finally, pdfimages extract all pictures listed and most of them are the same. The total size of all pictures is really huge. I upload the pdf, and my simple patch below ( may not good, but work :D ).
From 237f4e0887eff2f22d5542dfed33fa94a8c7b0ff Mon Sep 17 00:00:00 2001
From: Ryan <ryanorz@126.com>
Date: Tue, 21 Feb 2017 16:11:53 +0800
Subject: [PATCH] Fix(poppler-utils): pdfimages extract too many same pictures
with the same object number.
---
utils/ImageOutputDev.cc | 8 ++++++++
utils/ImageOutputDev.h | 2 ++
2 files changed, 10 insertions(+)
diff --git a/utils/ImageOutputDev.cc b/utils/ImageOutputDev.cc
index 5de51ad..26bf95b 100644
--- a/utils/ImageOutputDev.cc
+++ b/utils/ImageOutputDev.cc
@@ -442,6 +442,14 @@ void ImageOutputDev::writeImageFile(ImgWriter *writer, ImageFormat format, const
void ImageOutputDev::writeImage(GfxState *state, Object *ref, Stream *str,
int width, int height,
GfxImageColorMap *colorMap, GBool inlineImg) {
+ if (ref->isRef()) {
+ const Ref imageRef = ref->getRef();
+ if (refNums.find(imageRef.num) != refNums.end())
+ return;
+ else
+ refNums.insert(imageRef.num);
+ }
+
ImageFormat format;
if (dumpJPEG && str->getKind() == strDCT &&
diff --git a/utils/ImageOutputDev.h b/utils/ImageOutputDev.h
index a694bbc..89c67ac 100644
--- a/utils/ImageOutputDev.h
+++ b/utils/ImageOutputDev.h
@@ -35,6 +35,7 @@
#endif
#include <stdio.h>
+#include `<set>`
#include "goo/gtypes.h"
#include "goo/ImgWriter.h"
#include "OutputDev.h"
@@ -173,6 +174,7 @@ private:
int pageNum; // current page number
int imgNum; // current image number
GBool ok; // set up ok?
+ std::set`<int>` refNums;
};
#endif
--
2.10.2
**Attachment 129787**, "problem file":
[Linuxå__æ__å__å__æ__é___ä__æ__ç__v3.0_.pdf](/uploads/6679256d9842f8e250fbf39d91064ce1/Linuxå__æ__å__å__æ__é___ä__æ__ç__v3.0_.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/526exported images do not have metadata2018-10-11T08:54:33ZBugzilla Migration Userexported images do not have metadata## Submitted by pdknsk
Assigned to **poppler-bugs**
**[Link to original bug (#96939)](https://bugs.freedesktop.org/show_bug.cgi?id=96939)**
## Description
In particular ICC profiles and XMP. While XMP isn't essential, ICC profiles...## Submitted by pdknsk
Assigned to **poppler-bugs**
**[Link to original bug (#96939)](https://bugs.freedesktop.org/show_bug.cgi?id=96939)**
## Description
In particular ICC profiles and XMP. While XMP isn't essential, ICC profiles are for color accuracy.
For JPEGs, it's relatively easy to patch a JFIF header into the file with the ICC profile.
https://www.w3.org/Graphics/JPEG/jfif3.pdf
http://www.color.org/newiccspec.pdf (87)
Of course this doesn't solve the problem for other image types.
At the very least, a warning should be printed to alert the user about this.https://gitlab.freedesktop.org/poppler/poppler/-/issues/608pdfimages adds a superfluous 0.5 to image ppi shown using -list (patch provided)2018-10-11T08:20:10ZBugzilla Migration Userpdfimages adds a superfluous 0.5 to image ppi shown using -list (patch provided)## Submitted by fre..@..et.com
Assigned to **poppler-bugs**
**[Link to original bug (#104861)](https://bugs.freedesktop.org/show_bug.cgi?id=104861)**
## Description
Adding 0.5 to a double before formatting with "%5.0f" results in ...## Submitted by fre..@..et.com
Assigned to **poppler-bugs**
**[Link to original bug (#104861)](https://bugs.freedesktop.org/show_bug.cgi?id=104861)**
## Description
Adding 0.5 to a double before formatting with "%5.0f" results in a rounding error. Presumably it was originally done before truncating to int.
--- utils/ImageOutputDev.cc.old 2018-01-30 16:38:42.179170000 +0200
+++ utils/ImageOutputDev.cc 2018-01-30 16:39:13.506750000 +0200
@@ -234,13 +234,13 @@
double *mat = state->getCTM();
double width2 = mat[0] + mat[2];
double height2 = mat[1] + mat[3];
- double xppi = fabs(width*72.0/width2) + 0.5;
- double yppi = fabs(height*72.0/height2) + 0.5;
+ double xppi = fabs(width*72.0/width2);
+ double yppi = fabs(height*72.0/height2);
if (xppi < 1.0)
printf("%5.3f ", xppi);
else
printf("%5.0f ", xppi);
if (yppi < 1.0)
printf("%5.3f ", yppi);
else
printf("%5.0f ", yppi);