[poppler] Multi-threading rendering on Raspberry Pi
pqt at LEFerguson.com
pqt at LEFerguson.com
Wed Feb 22 03:23:46 UTC 2017
I have a PDF rendering program for sheet music running on a Raspberry Pi 3, using Poppler 0.51.0 built from source, running in QT5.8 through the QT5 API.
I am seeing some weird threading performance behavior. I am calling the page->renderToImage within a separate thread, or more precisely several of them.
I am not getting any errors, and the results are correct.
For example, in rendering the same two (and only two) pages in a single thread, it takes 5.7 and 5.6 seconds to render, a total of 11.3 seconds. When rendered in parallel it takes 8.6 seconds for the first to complete, and an additional 50ms +/- for the second, i.e. basically 8.6 seconds total.
There is no IO that I can see going on at the time, there is no swap file (so no swap usage), plenty of memory, and nothing else running except the desktop services to display the images.
That's faster, but not nearly as much faster as I anticipated.
Three at a time gives about 8 seconds for the first, about 1.5 seconds for the second, and 0.6 for the third (I say "about" as my 3 page render was different content).
Even though no IO occurs, increasing to 4 I still cannot get the processor busy (e.g. as seen by "top"), seeming to imply some constraint beyond cores.
Here's what is more strange. If I submit 3 pages in a row in order 1, 2, 3 to three separate threads (the Pi3 has 4 cores), these always finish in order 3, 2, 1. I've instrumented these in as many ways as I can to confirm the sequence (and yes, that they really are running in separate threads). That's not a big deal program-logic wise, but it is an odd symptom. That aspect is reproducible on a fast HyperV box I use for testing (it processes them fast enough that the rendering speed is not terribly meaningful there) - it is always in reverse order. And not all that close (i.e. it's not a stream IO issue with the debug output).
Makes me wonder if something is blocking/serialized, forcing the LIFO behavior and so perhaps keeping me from getting the most performance.
Are there any special considerations for using Poppler with multi-threaded rendering? Different cmake options for example? Different calling sequences?
I realize that the Pi3 architecture might be causing this, e.g. memory speed so multi-threading is less efficient. I really did not think much about it until I realized the renders (started within a millisecond of each other) always finish in reverse order of initiation.
Incidentally, I have tried compiles (of poppler as well as my application) with both -O2 and -O3 with negligible difference in performance.
Any suggestions or insights would be welcomed.
More information about the poppler