text().to_latin1() returns invalid STL string with zero bytes inside of it
I use this program to convert the pdf document to text:
#include <iostream>
#include "poppler-document.h"
#include "poppler-page.h"
using namespace std;
int main(int argc, char *argv[]) {
poppler::document *doc = poppler::document::load_from_file(argv[1]);
const int pagesNbr = doc->pages();
cout << "page count: " << pagesNbr << endl;
for (int i = 0; i < pagesNbr; i++) {
cout << "page " << (i+1) << " of " << pagesNbr << endl;
cout << doc->create_page(i)->text().to_latin1() << endl; // STL string
//cout << doc->create_page(i)->text().to_latin1().c_str() << endl; // C-string pointer in the same STL string
}
}
The document can be downloaded from here: http://sci-hub.tw/10.1016/j.cell.2018.03.006
I use the program above that prints text().to_latin1()
and its variation that prints text().to_latin1().c_str()
. The results differ on the page 23.
The attached screenshot has both programs' outputs comparison in the hex format.
text().to_latin1()
returns the zero byte that gets printed into the output, see the left side of the attached screenshot, and the right side has the corresponding C-string that treats this zero byte as a termination character.
The outputs should be identical because zero byte isn't allowed in std::string, but poppler returns it anyway, which is a bug.
poppler-0.77.0 on FreeBSD 12 amd64, installed from the package.