[poppler] [RFC] Extend regtest framework to track performance

Wed Dec 30 15:58:10 PST 2015

Hello again,

Am 31.12.2015 um 00:21 schrieb Albert Astals Cid:
> El Wednesday 30 December 2015, a les 20:00:17, Ihar Filipau va escriure:
>> On 12/30/15, Albert Astals Cid <aacid at kde.org> wrote:
>>> El Wednesday 30 December 2015, a les 17:04:42, Adam Reichold va escriure:
>>>> Hello again,
>>>>
>>>> as discussed in the code modernization thread, if we are going to make
>>>> performance-orient changes, we need a simple way to track functional and
>>>> performance regressions.
>>>>
>>>> The attached patch tries to extend the existing Python-based regtest
>>>> framework to measure run time and memory usage to spot significant
>>>> performance changes in the sense of relative deviations w.r.t. to these
>>>> two parameters. It also collects the sums of both which might be used as
>>>> "ball park" numbers to compare the performance effect of changes over
>>>> document collections.
>>>
>>> Have you tried it? How stable are the numbers? For example here i get for
>>> rendering the same file (discarding the first time that is loading the
>>> file
>>> into memory) numbers that range from 620ms to 676ms, i.e. ~10% variation
>>> without no change at all.

Do you refer to the numbers provided by the patch or to manually running
e.g. pdftoppm? If you refer to the patch, which iteration counts did you
use?

>> To make the timing numbers stable, the benchmark framework should
>> repeat the test few times. IME at least three time. (I often do as
>> many as five runs.)
>>
>> The final result is a pair: the average of the timing among all runs,
>> and (for example) standard deviation (or simply distance to the
>> min/max value) computed over all the timing number.
>>
>> {I occasionally test performance on an embedded system running of the
>> flash (no spinning disks, no networks, nothing to screw timing) yet I
>> still get variations as high as 5%. Performance testing on a PC is
>> even trickier business: some go as far as to reboot system in
>> single-user mode, and shutdown all unnecessary services. Pretty much
>> everything running in the background - and foreground, e.g. GUI - can
>> contribute to the unreliability of the numbers.}
>>
>> For a benchmark on normal Linux/etc, I would advise to perform the
>> test once to "warm up" the caches, and only then start with the
>> measured test runs.
>>
>> Summary:
>> 1. A performance test framework should do a "warm up" phase. Timing is
>> discarded.
>> 2. A performance test framework should repeat the test 3/5/etc time,
>> collecting the timing information.
>> 3. The collected timing are averaged and the deviation (or distance to
>> min/max) is computed. The average is the official benchmark result,
>> the deviation/etc is the indication of the reliability of the
>> benchmark.

The previously attached patch does a configurable number of warm-up
iterations and measurement iterations and collects the mean and sample
standard deviation from the measurement runs. The default is currently 5
warm-up iterations and 10 measurement iterations, but as this an RFC,
these numbers are very much open to discussion.

>> fyi.
>>
>> P.S. Note that 600ms is an OK-ish duration for a benchmark: not too
>> short, not too long.
> 
> 600 ms is the rendering time one page of one of the 1600 files in one of the 3 
> or 4 backends.
> 
> ;)

Yes, I did try it with a small selection of documents from the part of
the regression test suite that is available to me. I am currently
running it on the whole 1300 documents, but that will take more some
time to complete.

I did see variations as high as five percent as well, at least using 5
resp. 10 iterations. For me, memory usage shows a higher variation than
run time. For example on a dozen mathematical texts, I always get less
than one percent changes in run time and less than five percent changes
in memory usage using these iteration counts.

Of course, measuring the gross properties of running commands like
pdftotext is a rather coarse experiment and will show a lot of variation
and will be influenced by system-wide effects, especially on desktop
systems that often do other stuff besides running benchmarks.

But the upside is that unchanged code can be run and measured without
changes that might affect the measurement itself. And as with most
experimental data, random statistical variation can be reduced by
increasing the number of iterations. Also, as Ihar indicated, the
system-wide effects can be minimized by more careful preparation.

Therefore I do think, that this approach is viable. To get more fine
grained results, I would suggest using micro benchmarks, e.g. using the
facilities provided by QTest. This is not the Java Microbenchmarking
Harness, but I do think, it can provide a benefit to continuously
tracking Poppler's performance.

Regards, Adam.

>> But. Generally, shorter the duration of the
>> benchmark, less reliable the timing numbers are (higher the
>> deviation). Longer the duration - more reliable numbers are (higher
>> the deviation).
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20151231/a256e5c2/attachment.sig>