[poppler] [RFC] Extend regtest framework to track performance

Wed Dec 30 11:00:17 PST 2015

On 12/30/15, Albert Astals Cid <aacid at kde.org> wrote:
> El Wednesday 30 December 2015, a les 17:04:42, Adam Reichold va escriure:
>> Hello again,
>>
>> as discussed in the code modernization thread, if we are going to make
>> performance-orient changes, we need a simple way to track functional and
>> performance regressions.
>>
>> The attached patch tries to extend the existing Python-based regtest
>> framework to measure run time and memory usage to spot significant
>> performance changes in the sense of relative deviations w.r.t. to these
>> two parameters. It also collects the sums of both which might be used as
>> "ball park" numbers to compare the performance effect of changes over
>> document collections.
>
> Have you tried it? How stable are the numbers? For example here i get for
> rendering the same file (discarding the first time that is loading the file
> into memory) numbers that range from 620ms to 676ms, i.e. ~10% variation
> without no change at all.
>

To make the timing numbers stable, the benchmark framework should
repeat the test few times. IME at least three time. (I often do as
many as five runs.)

The final result is a pair: the average of the timing among all runs,
and (for example) standard deviation (or simply distance to the
min/max value) computed over all the timing number.

{I occasionally test performance on an embedded system running of the
flash (no spinning disks, no networks, nothing to screw timing) yet I
still get variations as high as 5%. Performance testing on a PC is
even trickier business: some go as far as to reboot system in
single-user mode, and shutdown all unnecessary services. Pretty much
everything running in the background - and foreground, e.g. GUI - can
contribute to the unreliability of the numbers.}

For a benchmark on normal Linux/etc, I would advise to perform the
test once to "warm up" the caches, and only then start with the
measured test runs.

Summary:
1. A performance test framework should do a "warm up" phase. Timing is
discarded.
2. A performance test framework should repeat the test 3/5/etc time,
collecting the timing information.
3. The collected timing are averaged and the deviation (or distance to
min/max) is computed. The average is the official benchmark result,
the deviation/etc is the indication of the reliability of the
benchmark.

fyi.

P.S. Note that 600ms is an OK-ish duration for a benchmark: not too
short, not too long. But. Generally, shorter the duration of the
benchmark, less reliable the timing numbers are (higher the
deviation). Longer the duration - more reliable numbers are (higher
the deviation).