poppler issueshttps://gitlab.freedesktop.org/poppler/poppler/-/issues2023-09-26T23:42:41Zhttps://gitlab.freedesktop.org/poppler/poppler/-/issues/904pdftotext inserts newline when there is none2023-09-26T23:42:41ZWitold Barylukpdftotext inserts newline when there is noneSource pdf: https://www.ne.ch/autorites/DFS/SCSP/medecin-cantonal/maladies-vaccinations/Documents/Covid-19-Statistiques/COVID19_PublicationInternet.pdf
snapshot from archive: https://web.archive.org/web/20200408054553if_/https://www.ne....Source pdf: https://www.ne.ch/autorites/DFS/SCSP/medecin-cantonal/maladies-vaccinations/Documents/Covid-19-Statistiques/COVID19_PublicationInternet.pdf
snapshot from archive: https://web.archive.org/web/20200408054553if_/https://www.ne.ch/autorites/DFS/SCSP/medecin-cantonal/maladies-vaccinations/Documents/Covid-19-Statistiques/COVID19_PublicationInternet.pdf
[COVID19_PublicationInternet.pdf](/uploads/ea27c1aaf22537c500145735ba474fc9/COVID19_PublicationInternet.pdf)
I am using `pdftotext -layout` for this.
This happens both with version 0.71 (Debian testing) and 0.85 (Debian experimental).
Example of problematic conversions:
Start of the document:
![first_page_header](/uploads/e26003f21f0aeb83b3f197560c1e23e2/first_page_header.png)
Output:
```
Servicedel
asantépubli
que
Donnéesbaséessurlesdéc
l ar
ati
onsdelabo
Neuc
hât
el-CasCOVI
D-19posi
tif
s
Tableauact
uali
```
End of the document (table):
![last_page_table](/uploads/b4590461d225733dcf84f1c711ac3ddb/last_page_table.png)
Output:
```
8avri
l2020 5 518 53 3 7 63 3 7 4 14 1 37
9avri
l2020 18 536 48 3 7 58 3 7 4 14
10avri
l2020 52 3 8 63 3 8 4 15
```
Notice the new line after `avri`.https://gitlab.freedesktop.org/poppler/poppler/-/issues/1070pdftotext skips non-ASCII characters in PDF annotations2023-01-31T09:32:58ZOliver Freyermuthpdftotext skips non-ASCII characters in PDF annotationsTrying to convert [main.pdf](/uploads/cb6982832c554485d41dcbed028ea897/main.pdf) with `pdftotext` errors with:
```
Syntax Error: AnnotWidget::layoutText, cannot convert U+00EF
```
for the non-ASCII character ï (as in "naïve"), and also f...Trying to convert [main.pdf](/uploads/cb6982832c554485d41dcbed028ea897/main.pdf) with `pdftotext` errors with:
```
Syntax Error: AnnotWidget::layoutText, cannot convert U+00EF
```
for the non-ASCII character ï (as in "naïve"), and also for other such characters. They are dropped from the text output.https://gitlab.freedesktop.org/poppler/poppler/-/issues/318[PATCH] Seccomp sandbox support for pdftotext2018-10-25T18:17:41ZBugzilla Migration User[PATCH] Seccomp sandbox support for pdftotext## Submitted by valo
Assigned to **poppler-bugs**
**[Link to original bug (#100224)](https://bugs.freedesktop.org/show_bug.cgi?id=100224)**
## Description
Created attachment 130253
seccomp support for pdftotext
Since some of the ...## Submitted by valo
Assigned to **poppler-bugs**
**[Link to original bug (#100224)](https://bugs.freedesktop.org/show_bug.cgi?id=100224)**
## Description
Created attachment 130253
seccomp support for pdftotext
Since some of the poopler tools, like pdftotext are used by some file managers to automatically parse pdf files for preview, I thought it might be a good idea to use some sandboxing.
This is a patch that adds seccomp filter to pdftotext. This can also be applied to the other tools that poppler provides, reducing the risk of successful exploitation of poppler (and other used library) vulnerabilities significantly.
I found this quite easy to apply and would be happy to help if you are interested in using this.
This patch can be applied to poppler 0.52.0 without further changes
**Patch 130253**, "seccomp support for pdftotext":
[pdftotext_seccomp.patch](/uploads/92965944b5d4725bc29bd5f741d4c83c/pdftotext_seccomp.patch)