Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • P poppler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 656
    • Issues 656
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 44
    • Merge requests 44
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • poppler
  • poppler
  • Issues
  • #725

Closed
Open
Created Feb 19, 2019 by Nathanael Noblet@gnat

Syntax errors are output to the console

I have a PDF (a few of them actually) that when pdftotext is used (or a program we wrote using the cpp interface). Outputs a series of errors to the console. Other programs don't just wondering if it is a fatal error or what's going on. For example, I get the following output with CAV-2-FORMB.pdf

Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string
Syntax Error: Unknown font in field's DA string
Syntax Error: Missing 'Tf' operator in field's DA string

The reason I found this was we were hashing the contents of PDFs. At first by just hashing the whole document. However some documents our system sees are the same but are produced dynamically, thus the timestamp changes but nothing else does. So we wrote the following to hash only the contents. When we ran our tests we started seeing the output above and checked against pdftotext and friends.

I figured I would include our hashing code as well as I'm wondering if

a) its a reliable way to detect identical PDF documents or if different versions of poppler could conceivably provide slightly different output causing the hashes to change with poppler updates. b) Are there better ways of hashing a document's content while ignoring the creation timestamps etc?

std::string hashContents(poppler::document *doc) {
    char mdString[SHA_DIGEST_LENGTH * 2 + 1];
    unsigned char md[SHA_DIGEST_LENGTH];

    poppler::page *page;
    poppler::ustring pageData;
    poppler::byte_array arr;

    int lastPage = doc->pages();

    SHA256_CTX context;
    if (!SHA256_Init(&context)) {
        throw Php::Exception("Unable to initialize openssl context");
    }

    for (int x = 0; x < lastPage; x++) {
        page     = doc->create_page(x);
        pageData = page->text(page->page_rect(poppler::media_box));
        arr      = pageData.to_utf8();

        if (!SHA256_Update(&context, (unsigned char*)&arr[0], arr.size())) {
            throw Php::Exception("Unable to initialize openssl context");
        }
        delete page;
    }

    if (!SHA256_Final(md,&context)) {
        throw Php::Exception("Unable to initialize openssl context");
    }

    for (int i = 0; i < SHA_DIGEST_LENGTH; i++) {
       sprintf(&mdString[i*2], "%02x", (unsigned int)md[i]);
    }

    return mdString;
}
Edited Feb 19, 2019 by Nathanael Noblet
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking