poppler issueshttps://gitlab.freedesktop.org/poppler/poppler/-/issues2024-03-19T15:23:55Zhttps://gitlab.freedesktop.org/poppler/poppler/-/issues/1476Poppler::Page::text not working correctly with RawOrderLayout2024-03-19T15:23:55ZStefanBruensPoppler::Page::text not working correctly with RawOrderLayoutI am trying to get the plain text from a document, in content order.
`Page::text(QRectF{}, Page::PhysicalLayout)` works reasonably well, and is able to extract the complete contents. For `Page::RawOrderLayout`, the results are fairly br...I am trying to get the plain text from a document, in content order.
`Page::text(QRectF{}, Page::PhysicalLayout)` works reasonably well, and is able to extract the complete contents. For `Page::RawOrderLayout`, the results are fairly broken:
- The first, trivial document returns the contents without spaces between words.
- The second, slightly more complex document does not return any text at all.
When using `pdftotext`, with `-raw`, `-layout` or "default", the content is correct.
The missing spaces are likely caused by implementation differences in TextOutputDev between `TextPage::getText` (used by `Popper::Page::text`) and `TextPage::dump` (used by pdftotext) - the latter has some code to insert spaces:
https://gitlab.freedesktop.org/poppler/poppler/-/blame/master/poppler/TextOutputDev.cc?ref_type=heads&page=6#L5391https://gitlab.freedesktop.org/poppler/poppler/-/issues/1475Searching for two words only works in single lines with some pdf files2024-03-16T15:43:24ZNelson Benítez LeónSearching for two words only works in single lines with some pdf files:arrow_double_down: **This is a copy of https://gitlab.gnome.org/GNOME/evince/-/issues/2001 with some added notes** :arrow_double_down:
### Summary
Searching for two words only works in single lines with some pdf files
### Descriptio...:arrow_double_down: **This is a copy of https://gitlab.gnome.org/GNOME/evince/-/issues/2001 with some added notes** :arrow_double_down:
### Summary
Searching for two words only works in single lines with some pdf files
### Description
I found that while searching for two (or more) words Evince will not show results where the first word is at the and of a line and the second is at the beginning at a new line.
This surely happens with files exported from LibreOffice, but these files can be correctly searched in Okular and Qoppa PDF Studio.
I attached an example pdf. Try searching in it for:
`take steps`
`refused protection`
[evince-search-sample.pdf](/uploads/042e1b9f674b9f8a4663c56e6e873f7d/evince-search-sample.pdf)
### Solution
The problem is the search code which Poppler's glib uses `TextTextPage::findText()` currently does not support matching across two lines when the second line falls in the next paragraph. And pdf files exported from Libreoffice docs with line spacing > 1.5 are interpreted by Poppler as each line being a paragraph itself (due to line spacing).
Regardless of Poppler's paragraph detecting code could be improved, an obvious fix is to make `TextTextPage::findText()` to also work from last line of a paragraph to first line of next paragraph, that's what the MR submitted does.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1474Okular / Poppler slow to fully render this single-page PDF (takes 10 seconds)2024-03-08T22:15:52ZJeff Fortin TamOkular / Poppler slow to fully render this single-page PDF (takes 10 seconds)Potentially a bit similar to #1473, but presumably less complex and maybe caused by something different… this document: [invitation_-_sample_from_PDFjs_github_issue_3809.pdf](/uploads/19e32abfbe67d07c07367712c4aab6de/invitation_-_sample_...Potentially a bit similar to #1473, but presumably less complex and maybe caused by something different… this document: [invitation_-_sample_from_PDFjs_github_issue_3809.pdf](/uploads/19e32abfbe67d07c07367712c4aab6de/invitation_-_sample_from_PDFjs_github_issue_3809.pdf) (borrowed from https://github.com/mozilla/pdf.js/issues/3809) takes 10 seconds to fully render in Okular 23.08 on Fedora 39 with Wayland.
The image and some of the text appears within roughly 6 seconds, but the rest of the text takes up to the 10 seconds mark (on my stopwatch) to render.
Evince is similarly affected (except the fact that it only displays something once fully rendered, not in realtime).
What it looks like with Sysprof 46:
| Okular 23.08 | Evince 45 |
| - | - |
| ![Sysprof_46_standalone_capture_of_Okular_rendering_German__22invitation_22_sample_-_flame_graph](/uploads/93816bc091baa7a09e1cb5386ceff38a/Sysprof_46_standalone_capture_of_Okular_rendering_German__22invitation_22_sample_-_flame_graph.png) | ![Sysprof_46_standalone_capture_of_Evince_rendering_German__22invitation_22_sample_-_flame_graph](/uploads/5fa2482f1794ba163bb9b815d0405058/Sysprof_46_standalone_capture_of_Evince_rendering_German__22invitation_22_sample_-_flame_graph.png) |https://gitlab.freedesktop.org/poppler/poppler/-/issues/1473Okular / Poppler very slow to render the 1st page of MagPi magazine issue 872024-03-08T22:06:51ZJeff Fortin TamOkular / Poppler very slow to render the 1st page of MagPi magazine issue 87[This magazine issue](https://magpi.raspberrypi.com/issues/87) has a publicly available PDF that can be directly downloaded [here](https://magpi.raspberrypi.com/issues/87/pdf/download).
For some reason, it seems the 1st page of that do...[This magazine issue](https://magpi.raspberrypi.com/issues/87) has a publicly available PDF that can be directly downloaded [here](https://magpi.raspberrypi.com/issues/87/pdf/download).
For some reason, it seems the 1st page of that document is particularly heavy, compared to the 2nd page.
With Poppler 23.08.0 on Fedora 39 on Wayland, Evince 45 and Okular 23.08 take about 15-20+ seconds to render the first page at reasonable/normally sized window sizes.
Here is the output of Sysprof 46, showing what happens when opening and loading that document on the 1st page directly:
| Okular 23.08 | Evince 45 |
| - | - |
| ![Sysprof_46_standalone_capture_of_Okular_rendering_the_1st_page_of_MagPi_magazine_issue_87_-_flame_graph](/uploads/aa182101bf7e54b81fe3f7a9dc1f3420/Sysprof_46_standalone_capture_of_Okular_rendering_the_1st_page_of_MagPi_magazine_issue_87_-_flame_graph.png) | ![Sysprof_46_standalone_capture_of_Evince_rendering_the_1st_page_of_MagPi_magazine_issue_87_-_flame_graph](/uploads/c71440197c1b55d408db7f8c788d695f/Sysprof_46_standalone_capture_of_Evince_rendering_the_1st_page_of_MagPi_magazine_issue_87_-_flame_graph.png) |
FWIW, PDF.js, while still slow, is able to render it about twice faster (corresponding issue [here](https://github.com/mozilla/pdf.js/issues/17785))https://gitlab.freedesktop.org/poppler/poppler/-/issues/1470Does Poppler have a collection of PDF files for testing purposes? Can you sha...2024-02-28T22:48:18ZyuyiDoes Poppler have a collection of PDF files for testing purposes? Can you share it for testing the popperI am currently conducting some detailed tests and hope to receive more test filesI am currently conducting some detailed tests and hope to receive more test fileshttps://gitlab.freedesktop.org/poppler/poppler/-/issues/1468pdftotext should dehyphenate footmisc footnotes2024-03-04T15:39:20ZAlex Chalkpdftotext should dehyphenate footmisc footnotesWhen I run `pdftotext file-with-hyphenation.pdf -`, it dehyphenates the text in the main document, but not footnotes created using the package `footmisc`.
n.b. `pdftotext` does the right thing for a regular hyphenated `\footnote`. Seein...When I run `pdftotext file-with-hyphenation.pdf -`, it dehyphenates the text in the main document, but not footnotes created using the package `footmisc`.
n.b. `pdftotext` does the right thing for a regular hyphenated `\footnote`. Seeing as the rendered output of `footmisc` commands is the same (the difference is footmisc pulls the final output from a bibliography), perhaps the code used to print `\footnote` output can just be reused?https://gitlab.freedesktop.org/poppler/poppler/-/issues/1466Build on ubuntu 22.04 with mingw fails with INT32 conflict definition2024-02-18T17:33:52ZGregor KališnikBuild on ubuntu 22.04 with mingw fails with INT32 conflict definitionHi.
I am trying to cross-compile for windows (w64) with libjpeg-v9f.
Build error:
```
[ 37%] Building CXX object CMakeFiles/poppler.dir/poppler/ImageEmbeddingUtils.cc.obj
In file included from /usr/share/mingw-w64/include/winnt.h:150,
...Hi.
I am trying to cross-compile for windows (w64) with libjpeg-v9f.
Build error:
```
[ 37%] Building CXX object CMakeFiles/poppler.dir/poppler/ImageEmbeddingUtils.cc.obj
In file included from /usr/share/mingw-w64/include/winnt.h:150,
from /usr/share/mingw-w64/include/minwindef.h:163,
from /usr/share/mingw-w64/include/windef.h:9,
from /tmp/build-windows/libs/poppler/src/poppler_external-build/poppler/poppler-config.h:133,
from /tmp/build-windows/libs/poppler/src/poppler_external/poppler/Error.h:32,
from /tmp/build-windows/libs/poppler/src/poppler_external/poppler/Object.h:45,
from /tmp/build-windows/libs/poppler/src/poppler_external/poppler/ImageEmbeddingUtils.cc:27:
/usr/share/mingw-w64/include/basetsd.h:31:22: error: conflicting declaration ‘typedef int INT32’
31 | typedef signed int INT32,*PINT32;
| ^~~~~
In file included from /usr/x86_64-w64-mingw32/include/jpeglib.h:27,
from /tmp/build-windows/libs/poppler/src/poppler_external/poppler/ImageEmbeddingUtils.cc:17:
/usr/x86_64-w64-mingw32/include/jmorecfg.h:165:14: note: previous declaration as ‘typedef long int INT32’
165 | typedef long INT32;
| ^~~~~
```
cmake config used:
```
-DCMAKE_PREFIX_PATH=${CMAKE_PREFIX_PATH}
-DCMAKE_BUILD_TYPE=release
-DCMAKE_INSTALL_PREFIX=${CMAKE_CURRENT_BINARY_DIR}/${TARGET_NAME}
-DENABLE_BOOST=OFF
-DBUILD_SHARED_LIBS=OFF
-DBUILD_CPP_TESTS=OFF
-DBUILD_GTK_TESTS=OFF
-DBUILD_MANUAL_TESTS=OFF
-DBUILD_QT5_TESTS=OFF
-DBUILD_QT6_TESTS=OFF
-DENABLE_CPP=OFF
-DENABLE_QT5=OFF
-DENABLE_QT6=ON
-DENABLE_ZLIB=OFF
-DENABLE_GLIB=OFF
-DENABLE_GOBJECT_INTROSPECTION=OFF
-DENABLE_LIBCURL=OFF
-DENABLE_LIBOPENJPEG=none
-DENABLE_UTILS=OFF
-DENABLE_DCTDECODER=libjpeg
-DWITH_PNG=ON
-DWITH_TIFF=OFF
-DWITH_NSS3=OFF
```
Tried with poppler versions `21.12.00` and `24.02.00`.
By adding `#include <poppler-config.h>` to top of file `ImageEmbeddingUtils.cc` solved the build issue.
Thank you.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1465Does not show text of Apple-edited PDFs2024-02-16T18:09:56ZDorla HutchDoes not show text of Apple-edited PDFsI blackened half of the first page using PDF24 which fixed the rendering bug with the first page (text is displayed again unlike for the other pages).
SUMMARY
=======
When the PDF is opened, the hand-written annotations are visible but...I blackened half of the first page using PDF24 which fixed the rendering bug with the first page (text is displayed again unlike for the other pages).
SUMMARY
=======
When the PDF is opened, the hand-written annotations are visible but not the original PDF text (all is white).
Same happens with Firefox **but it is different from Chrome or the renderer that Dolphin uses** where all of the PDF text is visible.
STEPS TO REPRODUCE
==================
1. Annotate PDF with an apple tablet device (iPad Pro, 5th Gen, exported in GoodNotes 5)
2. Open The PDF in Okular or Firefox
OBSERVED RESULT
===============
Hand-written annotations are shown, everything else (text, lines) is no
EXPECTED RESULT
===============
Hand-written annotations are shown with everything else
Broken PDF:
===========
[2023_12_13.pdf](/uploads/9c1f29064ae072e3a6bdb399f0d2da23/2023_12_13.pdf)
[2023-12-12_not_broken.pdf](/uploads/311dba8f0d5eee766cf70bc4c64618a7/2023-12-12_not_broken.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1463pdftocairo - Problem with unembedded CID TrueType font with Identity Encoding2024-02-14T13:15:49ZHakan Usaklipdftocairo - Problem with unembedded CID TrueType font with Identity EncodingHello,
The supplied sample file displays fine in Adobe Acrobat, PDF-XChange, Foxit and many other PDF viewers.
It has unembedded CID Fonts.
The following command line on Windows 64bit using Poppler version 23.11 complains about a Syntax...Hello,
The supplied sample file displays fine in Adobe Acrobat, PDF-XChange, Foxit and many other PDF viewers.
It has unembedded CID Fonts.
The following command line on Windows 64bit using Poppler version 23.11 complains about a Syntax error in the Fonts. The created output is unusable, question marks instead of glyhs.
I am guessing the Fonts are defined in a 'Grayzone' of the PDF-Specification but is it reasonable to expect that pdftocairo/poppler library could handle these types of files to 'refry' and burn-in (embed) fonts properly as well and pull a suitable font from the systems font dir, (or C:/windows/fonts/)
```
pdftocairo.exe -pdf "d:\temp\input.pdf" "d:\temp\output_pp.pdf"
Syntax Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array
Syntax Error: non-embedded font using identity encoding: Arial
Syntax Error: non-embedded font using identity encoding: Calibri Light
Syntax Error: non-embedded font using identity encoding: Calibri
Syntax Error: non-embedded font using identity encoding: Arial,Bold
Syntax Error: non-embedded font using identity encoding: Calibri,Bold
```
Thank you and Best Regards
[input.pdf](/uploads/5eb8b5c083a215e898b76b514143fd45/input.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1462libc++-19: implicit instantiation of undefined template 'std::char_traits<uns...2024-02-05T14:35:02ZLinux Userlibc++-19: implicit instantiation of undefined template 'std::char_traits<unsigned short>'OS: Gentoo Linux amd64 musl/clang
```
$ clang --version
clang version 19.0.0git78b4e7c5+libcxx
Target: x86_64-gentoo-linux-musl
Thread model: posix
InstalledDir: /usr/lib/llvm/19/bin
Configuration file: /etc/clang/x86_64-gentoo-linux-mus...OS: Gentoo Linux amd64 musl/clang
```
$ clang --version
clang version 19.0.0git78b4e7c5+libcxx
Target: x86_64-gentoo-linux-musl
Thread model: posix
InstalledDir: /usr/lib/llvm/19/bin
Configuration file: /etc/clang/x86_64-gentoo-linux-musl-clang.cfg
```
Compiling `poppler-9999` fails with the following error:
```bash
[269/283] /usr/lib/ccache/bin/clang++ -Dpoppler_cpp_EXPORTS -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999 -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/fofi -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/goo -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/poppler -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build/poppler -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build/cpp -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -O3 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -flto -stdlib=libc++ -Wnon-virtual-dtor -Woverloaded-virtual -std=c++17 -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -MD -MT cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o -MF cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o.d -o cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o -c /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.cpp
FAILED: cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o
/usr/lib/ccache/bin/clang++ -Dpoppler_cpp_EXPORTS -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999 -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/fofi -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/goo -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/poppler -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build/poppler -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp -I/var/tmp/portage/app-text/poppler-9999/work/poppler-9999_build/cpp -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -O3 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -flto -stdlib=libc++ -Wnon-virtual-dtor -Woverloaded-virtual -std=c++17 -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -MD -MT cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o -MF cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o.d -o cpp/CMakeFiles/poppler-cpp.dir/poppler-destination.cpp.o -c /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.cpp
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.cpp:24:
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.h:25:
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-global.h:32:
/usr/include/c++/v1/string:730:43: error: implicit instantiation of undefined template 'std::char_traits<unsigned short>'
730 | static_assert((is_same<_CharT, typename traits_type::char_type>::value),
| ^
/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-global.h:101:43: note: in instantiation of template class 'std::basic_string<unsigned short>' requested here
101 | class POPPLER_CPP_EXPORT ustring : public std::basic_string<unsigned short>
| ^
/usr/include/c++/v1/__fwd/string.h:23:29: note: template is declared here
23 | struct _LIBCPP_TEMPLATE_VIS char_traits;
| ^
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.cpp:24:
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-destination.h:25:
In file included from /var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-global.h:32:
In file included from /usr/include/c++/v1/string:625:
/usr/include/c++/v1/string_view:296:43: error: implicit instantiation of undefined template 'std::char_traits<unsigned short>'
296 | static_assert((is_same<_CharT, typename traits_type::char_type>::value),
| ^
/usr/include/c++/v1/__type_traits/is_convertible.h:28:102: note: in instantiation of template class 'std::basic_string_view<unsigned short>' requested here
28 | struct _LIBCPP_TEMPLATE_VIS is_convertible : public integral_constant<bool, __is_convertible(_T1, _T2)> {};
| ^
/usr/include/c++/v1/string:702:29: note: in instantiation of template class 'std::is_convertible<const std::basic_string<unsigned short> &, std::basic_string_view<unsigned short>>' requested here
702 | : public _BoolConstant< is_convertible<const _Tp&, basic_string_view<_CharT, _Traits> >::value &&
| ^
/usr/include/c++/v1/string:1044:27: note: in instantiation of template class 'std::__can_be_converted_to_string_view<unsigned short, std::char_traits<unsigned short>, std::basic_string<unsigned short>>' requested here
1044 | __enable_if_t<__can_be_converted_to_string_view<_CharT, _Traits, _Tp>::value &&
| ^
/usr/include/c++/v1/string:1047:93: note: while substituting prior template arguments into non-type template parameter [with _Tp = std::basic_string<unsigned short>]
1047 | _LIBCPP_METHOD_TEMPLATE_IMPLICIT_INSTANTIATION_VIS _LIBCPP_CONSTEXPR_SINCE_CXX20 explicit basic_string(const _Tp& __t)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
1048 | : __r_(__default_init_tag(), __default_init_tag()) {
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1049 | __self_view __sv = __t;
| ~~~~~~~~~~~~~~~~~~~~~~~
1050 | __init(__sv.data(), __sv.size());
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1051 | }
| ~
/usr/include/c++/v1/string:709:7: note: while substituting deduced template arguments into function template 'basic_string' [with _Tp = std::basic_string<unsigned short>, $1 = (no value)]
709 | class basic_string {
| ^
/var/tmp/portage/app-text/poppler-9999/work/poppler-9999/cpp/poppler-global.h:101:26: note: while declaring the implicit copy constructor for 'ustring'
101 | class POPPLER_CPP_EXPORT ustring : public std::basic_string<unsigned short>
| ^
/usr/include/c++/v1/__fwd/string.h:23:29: note: template is declared here
23 | struct _LIBCPP_TEMPLATE_VIS char_traits;
| ^
2 errors generated.
ninja: build stopped: subcommand failed.
```
The generic char_traits implementation has been deprecated in LLVM 17 and removed in https://github.com/llvm/llvm-project/commit/c3668779c13596e223c26fbd49670d18cd638c40.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1461Add option to not override files2024-01-29T22:48:29ZkenorbAdd option to not override filesCurrently when using "pdftotext file.pdf file.txt" syntax, the destination file is always overridden.
It would be great to have option to ignore the conversion if the file already exist.
Otherwise the default behaviour could be very dest...Currently when using "pdftotext file.pdf file.txt" syntax, the destination file is always overridden.
It would be great to have option to ignore the conversion if the file already exist.
Otherwise the default behaviour could be very destructive.
For example when you specify the same file as destination (by mistake), it's going to be zeroed. So there should be some safer option to work with which won't erase the existing files.
pdftotext version 22.02.0https://gitlab.freedesktop.org/poppler/poppler/-/issues/1460pdfimages should returns exit code 2 when cannot open output files2024-01-24T22:33:04ZFernando Herrerapdfimages should returns exit code 2 when cannot open output filesThis is the current behavior:
```
fer@dyckola:~$ pdfimages test-manuscript.pdf /dev/null/cannot-write-here/page-
I/O Error: Couldn't open image file '/dev/null/cannot-write-here/page--000.ppm'
fer@dyckola:~$ echo $?
0
```
But according...This is the current behavior:
```
fer@dyckola:~$ pdfimages test-manuscript.pdf /dev/null/cannot-write-here/page-
I/O Error: Couldn't open image file '/dev/null/cannot-write-here/page--000.ppm'
fer@dyckola:~$ echo $?
0
```
But according to the man page it should be 2:
```
EXIT CODES
The Xpdf tools use the following exit codes:
0 No error.
1 Error opening a PDF file.
2 Error opening an output file.
3 Error related to PDF permissions.
99 Other error.
```https://gitlab.freedesktop.org/poppler/poppler/-/issues/1458pdftotext: support tsv output in reading order2024-01-08T18:59:48ZFawaz Ahmedpdftotext: support tsv output in reading orderHello,
I see [tsv flag](https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/831) was added to emulate tesseract format.
Tesseract prints tsv in reading order, but the tsv output by pdftotext is not in reading order.
It wil...Hello,
I see [tsv flag](https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/831) was added to emulate tesseract format.
Tesseract prints tsv in reading order, but the tsv output by pdftotext is not in reading order.
It will be helpful if tsv follows `-layout` reading order, when `-tsv` is true.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1457Simplified Chinese display as a variant (Japanese) glyph2024-02-29T18:54:22ZFirestar-ReimuSimplified Chinese display as a variant (Japanese) glyphOriginal issue: https://bugs.kde.org/show_bug.cgi?id=461499
Wrong display: https://imgse.com/i/xXc0Qe
Correct display: https://imgse.com/i/xjJ1mD
You can see the characters: “探”、“将”、“关”
I use Okular + poppler-data
```
$ pdffonts 1.p...Original issue: https://bugs.kde.org/show_bug.cgi?id=461499
Wrong display: https://imgse.com/i/xXc0Qe
Correct display: https://imgse.com/i/xjJ1mD
You can see the characters: “探”、“将”、“关”
I use Okular + poppler-data
```
$ pdffonts 1.pdf | iconv -f gbk
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
方正书宋简体 CID TrueType GBK-EUC-H no no no 227 0
方正书宋_GBK CID TrueType GBK-EUC-H no no no 64 0
方正黑体_GBK CID TrueType GBK-EUC-H no no no 102 0
方正楷体_GBK CID TrueType GBK-EUC-H no no no 65 0
DY1+ZKWGVK-1 Type 1 Custom yes no yes 66 0
DY2+ZKWGVK-2 Type 1 Custom yes no yes 67 0
DY3+ZKWGVK-3 Type 1 Custom yes no yes 228 0
DY4+ZKWGVK-4 Type 1 Custom yes no yes 229 0
DY5+ZKWGVK-5 Type 1 Custom yes no yes 230 0
DY6+ZKWGVK-6 Type 1 Custom yes no yes 69 0
DY7+ZKWGVK-7 Type 1 Custom yes no yes 104 0
DY8+ZKWGVK-8 Type 1 Custom yes no yes 131 0
DY9+ZKWGVL-9 Type 1 Custom yes no yes 219 0
DY10+ZKWGVL-10 Type 1 Custom yes no yes 211 0
DY11+ZKWGVN-11 Type 1 Custom yes no yes 183 0
DY12+ZKWGVN-12 Type 1 Custom yes no yes 203 0
DY13+ZKWGVO-13 Type 1 Custom yes no yes 194 0
DY14+ZKWGVO-14 Type 1 Custom yes no yes 182 0
DY15+ZKWGVP-15 Type 1 Custom yes no yes 172 0
DY16+ZKWGVP-16 Type 1 Custom yes no yes 163 0
DY17+ZKWGVQ-17 Type 1 Custom yes no yes 155 0
DY18+ZKWGVR-18 Type 1 Custom yes no yes 145 0
DY19+ZKWGVS-19 Type 1 Custom yes no yes 130 0
DY20+ZKWGVT-20 Type 1 Custom yes no yes 120 0
DY21+ZKWGVT-21 Type 1 Custom yes no yes 103 0
DY22+ZKWGVT-22 Type 1 Custom yes no yes 105 0
DY23+ZKWGVT-23 Type 1 Custom yes no yes 96 0
DY24+ZKWGVT-24 Type 1 Custom yes no yes 68 0
DY25+ZKWGVT-25 Type 1 Custom yes no yes 70 0
```
PDF: https://pb.nichi.co/unveil-laptop-foil
It used Noto Sans CJK SC as a substitute
https://imgse.com/i/xXjE3F
but:
1. this is not SC (simplified Chinese) glyphs
2. I set SC higher than JP in `/etc/fonts/conf.d/64-language-selector-prefer.conf`
[PDF example](https://bugsfiles.kde.org/attachment.cgi?id=163003)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1454Unicode supplementary plane support in annotation2024-01-15T16:48:47ZKeyu TaoUnicode supplementary plane support in annotationCurrently, poppler/Annot.cc still assumes each Unicode (UTF-16) character (scalar) takes 2 bytes. (https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/poppler/Annot.cc#L3042, https://gitlab.freedesktop.org/poppler/poppler/-/blob...Currently, poppler/Annot.cc still assumes each Unicode (UTF-16) character (scalar) takes 2 bytes. (https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/poppler/Annot.cc#L3042, https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/poppler/Annot.cc#L3048-3049)
This is true for BMP (Basic Multilingual Plane) characters. However, some characters like emoji and some rare characters in natural languages, are not in BMP and takes 4 bytes in UTF-16:
```console
>>> # use Python console as an example
>>> "a".encode(encoding="utf-16")[2:] # BOM stripped
b'a\x00'
>>> "😀".encode(encoding="utf-16")[2:]
b'=\xd8\x00\xde'
>>> "𰻝".encode(encoding="utf-16")[2:]
b'\x83\xd8\xdd\xde'
```
I have tried to add supplementary plane handling inside `HorizontalTextLayouter` constructor like this:
```diff
diff --git a/poppler/Annot.cc b/poppler/Annot.cc
index e8db39ff..8147d89f 100644
--- a/poppler/Annot.cc
+++ b/poppler/Annot.cc
@@ -3044,18 +3044,31 @@ public:
newFontNeeded = false;
} else {
Unicode uChar;
+ int charLength;
if (isUnicode) {
uChar = (unsigned char)(text->getChar(i)) << 8;
uChar += (unsigned char)(text->getChar(i + 1));
+ charLength = 2;
+ // If uChar is in supplementary plane, we need to get the next character
+ // because the font may not have the glyph for the first character.
+ if (uChar >= 0xD800 && uChar <= 0xDBFF) {
+ if (i + 3 < text->getLength()) {
+ uChar = (uChar - 0xD800) * 0x400 + ((unsigned char)(text->getChar(i + 2)) << 8) + (unsigned char)(text->getChar(i + 3)) + 0x10000;
+ charLength = 4;
+ printf("uChar: %x\n", uChar);
+ }
+ }
} else {
uChar = pdfDocEncoding[text->getChar(i) & 0xff];
+ charLength = 1;
}
const std::string auxFontName = form->getFallbackFontForChar(uChar, *font);
if (!auxFontName.empty()) {
+ printf("auxFontName: %s\n", auxFontName.c_str());
std::shared_ptr<GfxFont> auxFont = form->getDefaultResources()->lookupFont(auxFontName.c_str());
// Here we just layout one char, we don't know if the one afterwards can be layouted with the original font
- GooString auxContents = GooString(text->toStr().substr(i, isUnicode ? 2 : 1));
+ GooString auxContents = GooString(text->toStr().substr(i, charLength));
if (isUnicode) {
auxContents.prependUnicodeMarker();
}
@@ -3070,13 +3083,14 @@ public:
// we also need to allow the character if we have not layouted anything yet because otherwise we will end up in an infinite loop
// because it is assumed we at least layout one character
if (!availableWidth || *availableWidth > 0 || (isUnicode && i == 2) || (!isUnicode && i == 0)) {
- i += isUnicode ? 2 : 1;
+ i += charLength;
data.emplace_back(outputText.toStr(), auxFontName, blockWidth, charCount);
}
} else {
+ printf("auxFontName: not found\n");
error(errSyntaxError, -1, "HorizontalTextLayouter, couldn't find a font for character U+{0:04uX}", uChar);
newFontNeeded = false;
- i += isUnicode ? 2 : 1;
+ i += charLength;
}
}
// Now layout the rest of the text with the original font
```
However, this does not work (I'm testing this with Okular) as it could not find font to show the new uChar. I'm afraid that further investigation is a bit beyond my knowledge :(https://gitlab.freedesktop.org/poppler/poppler/-/issues/1453Text fails to display in Cairo backend but it's ok in Okular and Acrobat Reader2024-01-01T12:58:10ZNelson Benítez LeónText fails to display in Cairo backend but it's ok in Okular and Acrobat ReaderThe attached PDF (created by Acrobat Distiller 6.0 in Windows) shows fine in Okular and Acrobat Reader, but Evince and Poppler Cairo backend (pdftocairo) fails to display very large portions of text, it seems something related to the fon...The attached PDF (created by Acrobat Distiller 6.0 in Windows) shows fine in Okular and Acrobat Reader, but Evince and Poppler Cairo backend (pdftocairo) fails to display very large portions of text, it seems something related to the fonts embedded.
[bug168518.pdf](/uploads/e7ca4a22af14e4579caeedf6f9856899/bug168518.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1452pdfimages -png and -tiff give inverted colour output2023-12-30T14:09:36ZShriramana Sharmapdfimages -png and -tiff give inverted colour outputPlease download the attachment which is just the first page from https://archive.org/details/wg224 (to avoid huge download). This contains black text on white background.
Run the commands:
```
pdfimages -png p.pdf q
pdfimages -tiff p.p...Please download the attachment which is just the first page from https://archive.org/details/wg224 (to avoid huge download). This contains black text on white background.
Run the commands:
```
pdfimages -png p.pdf q
pdfimages -tiff p.pdf r
pdfimages -all p.pdf s
fax2tiff -o s-000.tif $(< s-000.params) s-000.ccitt
```
We can see that the files q-000.png and r-000.tif display the colours inverted ie white text on black background whereas going to CCITT and then to TIF gives the correct output.
Please look into this and fix it. Thank you!
Attachment:
[p.pdf](/uploads/5b253d8d62460aa22c9ac8a3d8b1d00a/p.pdf)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1451pdf to svg shows raster lines (clippaths)2024-01-02T16:14:59Zrvanderboompdf to svg shows raster lines (clippaths)Hey,
we like to use pdftocairo -svg for our pdf to svg conversion, but some material that exists of images concatted to 1 images via clippath , show raster lines where the seperate images connect.
Some other paid tools do not do this an...Hey,
we like to use pdftocairo -svg for our pdf to svg conversion, but some material that exists of images concatted to 1 images via clippath , show raster lines where the seperate images connect.
Some other paid tools do not do this and show as it is visible in pdf.
Added the materal and the svg result.
[HFD_20231216_0_005_HI.pdf](/uploads/e3e430b97bed774cf0519ec386875a4e/HFD_20231216_0_005_HI.pdf)
[HFD_20231216_0_005_HI.svg](/uploads/9a3a1197942d9fa55de206ec3a896d9b/HFD_20231216_0_005_HI.svg)https://gitlab.freedesktop.org/poppler/poppler/-/issues/1450pdftocairo -pdf causes font errors that did not exist2023-12-14T22:24:28ZHakan Usaklipdftocairo -pdf causes font errors that did not existThe provided sample is a 1 page pdf without any errors in Adobe Acrobat.
[input.pdf](/uploads/e3ac808e1c5470a974e195224741cba6/input.pdf)
After processing on Windows with Poppler version 23.11.0
`pdftocairo.exe -pdf "c:\temp\input.pdf" ...The provided sample is a 1 page pdf without any errors in Adobe Acrobat.
[input.pdf](/uploads/e3ac808e1c5470a974e195224741cba6/input.pdf)
After processing on Windows with Poppler version 23.11.0
`pdftocairo.exe -pdf "c:\temp\input.pdf" "c:\temp\output.pdf"`
and checking the file in Adobe Acrobat, the following error message is introduced
![image](/uploads/1eeb6dc53c2d9d8fa93847e553fe49e6/image.png)
For your information and Best Regardshttps://gitlab.freedesktop.org/poppler/poppler/-/issues/1448[skia] FTBFS on Android with Fontconfig font manager2023-12-10T22:55:21ZJonLiu1993[skia] FTBFS on Android with Fontconfig font managerWhe I update port poppler version to [23.11.0](https://github.com/microsoft/vcpkg/pull/35494) I get this error:
````
CMakeFiles/poppler.dir/poppler/GlobalParams.cc.o -c /mnt/vcpkg-ci/buildtrees/poppler/src/er-23.11.0-08ca2759be.clean/pop...Whe I update port poppler version to [23.11.0](https://github.com/microsoft/vcpkg/pull/35494) I get this error:
````
CMakeFiles/poppler.dir/poppler/GlobalParams.cc.o -c /mnt/vcpkg-ci/buildtrees/poppler/src/er-23.11.0-08ca2759be.clean/poppler/GlobalParams.cc
/mnt/vcpkg-ci/buildtrees/poppler/src/er-23.11.0-08ca2759be.clean/poppler/GlobalParams.cc:1563:5: error: use of undeclared identifier 'displayFontDir'; did you mean 'displayFontDirs'?
displayFontDir = fontDir;
^~~~~~~~~~~~~~
displayFontDirs
/mnt/vcpkg-ci/buildtrees/poppler/src/er-23.11.0-08ca2759be.clean/poppler/GlobalParams.cc:1379:20: note: 'displayFontDirs' declared here
static const char *displayFontDirs[] = { "/usr/share/ghostscript/fonts", "/usr/local/share/ghostscript/fonts", "/usr/share/fonts/default/Type1", "/usr/share/fonts/default/ghostscript", "/usr/share/fonts/type1/gsfonts", nullptr };
^
/mnt/vcpkg-ci/buildtrees/poppler/src/er-23.11.0-08ca2759be.clean/poppler/GlobalParams.cc:1563:20: error: array type 'const char *[6]' is not assignable
displayFontDir = fontDir;
````
I see that displayFontDir is declared earlier in the file, so I don't know why I'm getting this error.
https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/poppler/GlobalParams.cc?ref_type=heads#L1278
````
// The path to the font directory. Set by GlobalParams::setFontDir()
static std::string displayFontDir;
````