[poppler] [RFC] Extend regtest framework to track performance

Fri Jan 1 06:47:13 PST 2016

Hello again,

as discussed below, using separate backends to implement testing for
performance regressions is not seen as viable option. Since QTest was
also ruled out to implement the actual measurement, attached is another
prototype implementation for a 'perftest' tool using Python.

Also as discussed below, I prefer not to do a large extension of the
regtest framework and hence created a separate Python-3-based
implementation. The redundancy w.r.t. to the existing regtest code base
is minimal IMHO and the reboot a rather simple implemenation, e.g. using
multiprocessing.Pool to handle task-based parallelism.

The test driver does not support to sub-commands:
* 'measure' together with a mode of 'splash', 'cairo' or 'text' which
will measure all documents in the given directory tree and store the
results in a compressed pickle file.
* 'compare' which takes two such result files and reports any
significant deviations and cumulative statistics to standard output.

The measurement mechanism is basically the same with only two differences:
* The driver can do per-document-and-page measurements in addition to
per-document measurements (controlled via a command-line switch).
* The driver will continuously check the accuracy of the result and not
do more repetitions if the target accuracy is already reached.

While the results are pretty much the same as with the regtest-backend
variant, i.e. deviations of less than 1% between runs using the same
code base using 1% target accuracy for the measurements, per-page
measurements look more complicated for now:

* The relationship between run time and memory usage is reversed, i.e.
due to the very short run time for a single page, the run time
measurements are much more inaccurate than the memory usage measurements.

* At least using 25, 50 and 100 iterations, the accuracy for single page
run time measurements seems to plateau at 2% on my system. (Full
document measurements are usually at less than 1%.)

Sensible next steps are IMHO to try to improve the accuracy of the
per-page measurements and adding a sub-command for reporting (and plotting).

Best regards, Adam.

Am 31.12.2015 um 14:33 schrieb Carlos Garcia Campos:
> Adam Reichold <adam.reichold at t-online.de> writes:
> 
>> Hello,
>>
>> Am 31.12.2015 um 12:55 schrieb Carlos Garcia Campos:
>>> Adam Reichold <adam.reichold at t-online.de> writes:
>>>
>>>> Hello,
>>>>
>>>> Am 30.12.2015 um 19:17 schrieb Carlos Garcia Campos:
>>>>> Albert Astals Cid <aacid at kde.org> writes:
>>>>>
>>>>>> El Wednesday 30 December 2015, a les 17:04:42, Adam Reichold va escriure:
>>>>>>> Hello again,
>>>>>>>
>>>>>>> as discussed in the code modernization thread, if we are going to make
>>>>>>> performance-orient changes, we need a simple way to track functional and
>>>>>>> performance regressions.
>>>>>>>
>>>>>>> The attached patch tries to extend the existing Python-based regtest
>>>>>>> framework to measure run time and memory usage to spot significant
>>>>>>> performance changes in the sense of relative deviations w.r.t. to these
>>>>>>> two parameters. It also collects the sums of both which might be used as
>>>>>>> "ball park" numbers to compare the performance effect of changes over
>>>>>>> document collections.
>>>>>>
>>>>>> Have you tried it? How stable are the numbers? For example here i get for 
>>>>>> rendering the same file (discarding the first time that is loading the file 
>>>>>> into memory) numbers that range from 620ms to 676ms, i.e. ~10% variation 
>>>>>> without no change at all.
>>>>>
>>>>> I haven't looked at the patches in detail yet, but I don't think the
>>>>> regtest framework should be the one measuring. I would add a tool for
>>>>> that, pdfperf or something like that, that could measure the internals
>>>>> (parsing, output devs, etc.) in a more detailed way. If we
>>>>> need to get more information from the poppler core we could just add a
>>>>> compile option to provide that info. And the regtest framework should
>>>>> just run the perf command and collect the results. A report command
>>>>> could compare results with refs or previous executions. I also think
>>>>> performance tests should be run with a different command, since we don't
>>>>> want to measure the performance of every single document we have in our
>>>>> test suite (it would take too long). Instead, we could select a set of
>>>>> documents with different kind of pdf features to run perf tests on those
>>>>> only.
>>>>>
>>>>> So, in summary, I would use a dedicated tool (depending on a build
>>>>> option if we need to get more info from the core), and maybe even a
>>>>> dedicated perf framework on top of that tool, since I consider perf tests
>>>>> different from rendering regression tests, the same way unit tests are
>>>>> handled by a different framework too.
>>>>
>>>> I agree that a dedicated tool might provide more detailed information.
>>>> But with the limited resources we have, even some information might be
>>>> useful. Of course we should make it reliable, e.g. by improving upon the
>>>> measurement procedure.
>>>
>>> Yes, you are right. We can probably start with something like this. I
>>> think there's an intermediate solution, though. I think we could at
>>> least measure times per page, since there are documents with lots of
>>> pages, and sometimes it's one page containing a complex pattern or image
>>> the one causing the regression on the whole document. Measuring every
>>> page makes it easier to track those regressions but also provides more
>>> accurate measurements than whole document times. For that we would
>>> probably need to change the tools (pdftoppm/cairo/text) to provide
>>> rendering times per page (using a command line switch, for example).
>>
>> This does sound like a useful addition, but I think it should also be
>> possible without adding benchmark options to the utilities: We can just
>> time several invocations that each render a single page after querying
>> the number of pages in the benchmark driver. The only downside to this
>> seems to be that this will make the individual benchmark durations
>> shorter and hence we will need more iterations to battle random
>> fluctuations.
>>
>>> I'm fine with extending the regtest framework, but what we don't want
>>> for sure is mixing performance and rendering tests. Our current test
>>> suite takes a long time, and also I don't think we should test
>>> performance in the same way. So, I would add a different subcommand
>>> run-perf-test, for example, to be able to run a different subset of
>>> tests and storing the results differently (using a json file or
>>> whatever, without affecting the rendering checksums). I'm not sure
>>> having references like for regtests is the best approach. In this case I
>>> would just keep the results of every revision. And a different
>>> subcommand could be implemented to compare results, producing the report
>>> with the improvements/regressions.
>>>
>>> What do you think?
>>
>> The main motivation for the current approach - that is adding additional
>> backends perftext and perfsplash to do measurements - was to reuse as
>> much of the existing regtest code base as possible. And I think that the
>> patch set shows how little code is necessary to implement it in this
>> particular way. Of course, it is somewhat of a hack as the abuse of the
>> _match_checksums backend method to compute the relative deviations shows.
> 
> Ah, they are new backends. This is indeed a hack because they are not
> actually backends. I'm sorry that I haven't had time to look at the
> patches in detail yet. I don't think adding new subcommands would be
> much harder, I can help with that if needed.
> 
>> I also think that it does fulfil two of the above requirements: Since
>> these are separate backends, they will not affect the rendering
>> regression tests in anyway. And since we can invoke the regtest
>> framework with different backends and test directories, we can already
>> use separate document selections for checking for functional and
>> performance regressions.
> 
> I see. It's useful for playing with it, but not a solution I would
> upstream.
> 
>> I also agree that having a reference point is not really suitable for
>> performance measurements and that we should just create outputs for each
>> revision and generate reports about the changes over time. (But then
>> again, the regtest framework could generally be made to work that way.
>> After all, the reference revision is special only to the user as the
>> current implementation already shows.)
>>
>> In any case, if I drop this patch and significantly diverge from the
>> regtest framework, I would probably just start over to have something
>> simple and self-contained, quite possibly implementing the separate
>> pdfperf application to be able to hook directly into the library as well
>> and hopefully reusing the IMHO rather nice benchmarking capabilities of
>> the QTest framework. In any case, it will probably take several days
>> instead of a single day like this patch.
> 
> I don't mind either way, but if something can be reused from the
> regtests framework, we could probably simply add a couple of subcommands
> as I suggested. I don't think we should depend on Qt for this, though.
> 
>> As to which we way is taken: I will abide by the preferences of the
>> maintainers here: If you want the proper benchmark driver, I will gladly
>> try to implement that. If you can live with this patch, I will be happy
>> to maintain the perf backends of the regtest framework in future. (Just
>> a significant extension of the already quite complicated regtest
>> framework is something I would like to avoid.)
> 
> I don't think it would be that much code to extend the current
> framework, but if you are going to do it, it's up to you. I can help to
> extend the framework if needed, though.
> 
>>>> Also users probably won't care about which part of the library did
>>>> produce a performance regression, so the overall numbers are indeed
>>>> interesting IMHO.
>>>
>>> This is not for users, this is for us :-) and we need to know which part
>>> of the code introduced the regression. I agree we can start with a
>>> simpler approach, but at least knowing if the problem is in a particular
>>> page or all pages regressed would help a lot to identify the problem.
>>
>> What I meant was if the user doesn't care, we shouldn't care as well in
>> the sense that it does not matter if a regression was introduced by the
>> parsing or the rendering code. Fixing it will usually involve additional
>> profiling using proper instrumentation in any case.
>>
>> Best regards, Adam.
>>
>>>> Especially since a developer can always do proper
>>>> profiling when looking into a specific regression. Microbenchmarks, e.g.
>>>> using QTest, also provide a certain balance w.r.t. these issues, as they
>>>> can be used to continuously observe the performance of specific portions
>>>> of the code base with more or less the same overhead as a unit test.
>>>>
>>>> Best regards, Adam.
>>>>
>>>>>> Cheers,
>>>>>>   Albert
>>>>>>
>>>>>>>
>>>>>>> The patch runs the measured commands repeatedly including warm-up
>>>>>>> iterations and collects statistics from these runs. The measurement
>>>>>>> results are stored as JSON documents with the actual program output of
>>>>>>> e.g. pdftotext or pdftoppm being discarded.
>>>>>>>
>>>>>>> To implement the check for relative deviations, it abuses the checksum
>>>>>>> comparison method and hence checksums are still computed for the JSON
>>>>>>> documents even though they are actually unnecessary. It is also limited
>>>>>>> to Unix-like operating systems (due to the use of the wait3 syscall to
>>>>>>> determine resource usage similar to the time command).
>>>>>>
>>>>>> _______________________________________________
>>>>>> poppler mailing list
>>>>>> poppler at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add-perftest.patch
Type: text/x-patch
Size: 12094 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20160101/b5f70f00/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20160101/b5f70f00/attachment.sig>