Syntax errors are output to the console
I have a PDF (a few of them actually) that when pdftotext is used (or a program we wrote using the cpp interface). Outputs a series of errors to the console. Other programs don't just wondering if it is a fatal error or what's going on. For example, I get the following output with CAV-2-FORMB.pdf
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
The reason I found this was we were hashing the contents of PDFs. At first by just hashing the whole document. However some documents our system sees are the same but are produced dynamically, thus the timestamp changes but nothing else does. So we wrote the following to hash only the contents. When we ran our tests we started seeing the output above and checked against pdftotext and friends.
I figured I would include our hashing code as well as I'm wondering if
a) its a reliable way to detect identical PDF documents or if different versions of poppler could conceivably provide slightly different output causing the hashes to change with poppler updates. b) Are there better ways of hashing a document's content while ignoring the creation timestamps etc?
std::string hashContents(poppler::document *doc) {
char mdString[SHA_DIGEST_LENGTH * 2 + 1];
unsigned char md[SHA_DIGEST_LENGTH];
poppler::page *page;
poppler::ustring pageData;
poppler::byte_array arr;
int lastPage = doc->pages();
SHA256_CTX context;
if (!SHA256_Init(&context)) {
throw Php::Exception("Unable to initialize openssl context");
}
for (int x = 0; x < lastPage; x++) {
page = doc->create_page(x);
pageData = page->text(page->page_rect(poppler::media_box));
arr = pageData.to_utf8();
if (!SHA256_Update(&context, (unsigned char*)&arr[0], arr.size())) {
throw Php::Exception("Unable to initialize openssl context");
}
delete page;
}
if (!SHA256_Final(md,&context)) {
throw Php::Exception("Unable to initialize openssl context");
}
for (int i = 0; i < SHA_DIGEST_LENGTH; i++) {
sprintf(&mdString[i*2], "%02x", (unsigned int)md[i]);
}
return mdString;
}