Text search is slow in general: find_text_with_options's get_text_page calls lots of Gfx functions spamming LCMS2's CreateTransform
This is a repost/migration of a comment I originally posted on @gpoo's issue #104, as I realize now that it might be somewhat different (or not), because issue 104 was about a specific document, whereas my bug report here is about slow search in general with any big document I've tried.
It seems as if, somehow, Poppler is spamming LCMS2 when doing text search operations.
Methodology
My test case today was to run a simple word search through the infamously big MS Office Open XML spec's main PDF, which you can publicly download (get the "4th edition", unzip, then unzip again the "Part 1" to find the biggest (35 MB) PDF file, which is 5000 pages). In that document, you can search for "unicode" (or "unicode character", or whatever words you fancy) in there. On my computer, for any of those queries, it takes:
- 2 minutes 48 seconds using Evince;
-
3 minutes 50 secondsusing "Papers" git as of 2024-03-06; - 2 minutes 35 seconds using "Papers" git as of 2024-04-22 (after we optimized the searchentry widget).
This kind of search slowness can also be seen with simpler documents such as the ThinkPad X220's hardware maintenance manual (150 pages). There, it takes 15 to 25 seconds to search for a word throughout the document.
I grabbed the latest git version of "Papers" (the GTK4 fork/continuation of Evince) and Evince 45, and profiled the whole system while searching through the contents of the Office Open XML spec's PDF (the ECMA-376 document mentioned above).
Before profiling, I installed as many relevant debuginfo symbols I could find on Fedora 39, with this command:
dnf debuginfo-install evince evince-libs poppler poppler-glib lcms2 lcms2-utils glib2 gtk4 gio
With Sysprof 46, I recorded a few seconds of searching for the expression "unicode characters" in that 5000 pages document.
Findings
Below are what we can see with these Sysprof 46 captures:
- Papers__22faster-search_22_branch_-_standalone_Sysprof_capture.tar.xz
- Evince_45_-_standalone_sysprof_capture.tar.xz
CPU vs IO usage
For both "Papers" and Evince, on my 8-threads CPU (an Intel Xeon W3520), only one of the logical CPUs/threads gets used (as known in #338 (closed)):
CPU usage: we can see it is single-threaded | Disk I/O: not the culprit |
---|---|
Function calls analysis
For the function calls analysis, I'm starting with Evince, because I have complete debug symbolization visibility for it (all the way through LCMS2), unlike "Papers":
Most expensive / frequently called functions (callgraph) with Evince | Flamegraph (also non-chronological, represents totals) with Evince |
---|---|
Most expensive / frequently called functions (callgraph) with "Papers" (incomplete symbols) | Flamegraph with "Papers" (incomplete symbols) |
---|---|
Even though it installed the debuginfo and debugsource packages for lcms2, with Papers' flatpaked environment I get mystery unreadable symbols for lcms2's function calls in the output, so I can't tell which exact LCMS2 functions are being called there so often, but we can presume they are the exact same functions as the ones we actually can see in Evince's poppler profiling output.
Compositor (GNOME Shell & Mutter 45.4, under Wayland) marks
Evince | Papers |
---|---|
From a layman's perspective, I don't understand why LCMS would be called by a text search function in poppler. Is that really normal?
I hope this info helps somehow!