[poppler] [RFC] Extend regtest framework to track performance

Thu Dec 31 05:33:11 PST 2015

Adam Reichold <adam.reichold at t-online.de> writes:

> Hello,
>
> Am 31.12.2015 um 12:55 schrieb Carlos Garcia Campos:
>> Adam Reichold <adam.reichold at t-online.de> writes:
>> 
>>> Hello,
>>>
>>> Am 30.12.2015 um 19:17 schrieb Carlos Garcia Campos:
>>>> Albert Astals Cid <aacid at kde.org> writes:
>>>>
>>>>> El Wednesday 30 December 2015, a les 17:04:42, Adam Reichold va escriure:
>>>>>> Hello again,
>>>>>>
>>>>>> as discussed in the code modernization thread, if we are going to make
>>>>>> performance-orient changes, we need a simple way to track functional and
>>>>>> performance regressions.
>>>>>>
>>>>>> The attached patch tries to extend the existing Python-based regtest
>>>>>> framework to measure run time and memory usage to spot significant
>>>>>> performance changes in the sense of relative deviations w.r.t. to these
>>>>>> two parameters. It also collects the sums of both which might be used as
>>>>>> "ball park" numbers to compare the performance effect of changes over
>>>>>> document collections.
>>>>>
>>>>> Have you tried it? How stable are the numbers? For example here i get for 
>>>>> rendering the same file (discarding the first time that is loading the file 
>>>>> into memory) numbers that range from 620ms to 676ms, i.e. ~10% variation 
>>>>> without no change at all.
>>>>
>>>> I haven't looked at the patches in detail yet, but I don't think the
>>>> regtest framework should be the one measuring. I would add a tool for
>>>> that, pdfperf or something like that, that could measure the internals
>>>> (parsing, output devs, etc.) in a more detailed way. If we
>>>> need to get more information from the poppler core we could just add a
>>>> compile option to provide that info. And the regtest framework should
>>>> just run the perf command and collect the results. A report command
>>>> could compare results with refs or previous executions. I also think
>>>> performance tests should be run with a different command, since we don't
>>>> want to measure the performance of every single document we have in our
>>>> test suite (it would take too long). Instead, we could select a set of
>>>> documents with different kind of pdf features to run perf tests on those
>>>> only.
>>>>
>>>> So, in summary, I would use a dedicated tool (depending on a build
>>>> option if we need to get more info from the core), and maybe even a
>>>> dedicated perf framework on top of that tool, since I consider perf tests
>>>> different from rendering regression tests, the same way unit tests are
>>>> handled by a different framework too.
>>>
>>> I agree that a dedicated tool might provide more detailed information.
>>> But with the limited resources we have, even some information might be
>>> useful. Of course we should make it reliable, e.g. by improving upon the
>>> measurement procedure.
>> 
>> Yes, you are right. We can probably start with something like this. I
>> think there's an intermediate solution, though. I think we could at
>> least measure times per page, since there are documents with lots of
>> pages, and sometimes it's one page containing a complex pattern or image
>> the one causing the regression on the whole document. Measuring every
>> page makes it easier to track those regressions but also provides more
>> accurate measurements than whole document times. For that we would
>> probably need to change the tools (pdftoppm/cairo/text) to provide
>> rendering times per page (using a command line switch, for example).
>
> This does sound like a useful addition, but I think it should also be
> possible without adding benchmark options to the utilities: We can just
> time several invocations that each render a single page after querying
> the number of pages in the benchmark driver. The only downside to this
> seems to be that this will make the individual benchmark durations
> shorter and hence we will need more iterations to battle random
> fluctuations.
>
>> I'm fine with extending the regtest framework, but what we don't want
>> for sure is mixing performance and rendering tests. Our current test
>> suite takes a long time, and also I don't think we should test
>> performance in the same way. So, I would add a different subcommand
>> run-perf-test, for example, to be able to run a different subset of
>> tests and storing the results differently (using a json file or
>> whatever, without affecting the rendering checksums). I'm not sure
>> having references like for regtests is the best approach. In this case I
>> would just keep the results of every revision. And a different
>> subcommand could be implemented to compare results, producing the report
>> with the improvements/regressions.
>> 
>> What do you think?
>
> The main motivation for the current approach - that is adding additional
> backends perftext and perfsplash to do measurements - was to reuse as
> much of the existing regtest code base as possible. And I think that the
> patch set shows how little code is necessary to implement it in this
> particular way. Of course, it is somewhat of a hack as the abuse of the
> _match_checksums backend method to compute the relative deviations shows.

Ah, they are new backends. This is indeed a hack because they are not
actually backends. I'm sorry that I haven't had time to look at the
patches in detail yet. I don't think adding new subcommands would be
much harder, I can help with that if needed.

> I also think that it does fulfil two of the above requirements: Since
> these are separate backends, they will not affect the rendering
> regression tests in anyway. And since we can invoke the regtest
> framework with different backends and test directories, we can already
> use separate document selections for checking for functional and
> performance regressions.

I see. It's useful for playing with it, but not a solution I would
upstream.

> I also agree that having a reference point is not really suitable for
> performance measurements and that we should just create outputs for each
> revision and generate reports about the changes over time. (But then
> again, the regtest framework could generally be made to work that way.
> After all, the reference revision is special only to the user as the
> current implementation already shows.)
>
> In any case, if I drop this patch and significantly diverge from the
> regtest framework, I would probably just start over to have something
> simple and self-contained, quite possibly implementing the separate
> pdfperf application to be able to hook directly into the library as well
> and hopefully reusing the IMHO rather nice benchmarking capabilities of
> the QTest framework. In any case, it will probably take several days
> instead of a single day like this patch.

I don't mind either way, but if something can be reused from the
regtests framework, we could probably simply add a couple of subcommands
as I suggested. I don't think we should depend on Qt for this, though.

> As to which we way is taken: I will abide by the preferences of the
> maintainers here: If you want the proper benchmark driver, I will gladly
> try to implement that. If you can live with this patch, I will be happy
> to maintain the perf backends of the regtest framework in future. (Just
> a significant extension of the already quite complicated regtest
> framework is something I would like to avoid.)

I don't think it would be that much code to extend the current
framework, but if you are going to do it, it's up to you. I can help to
extend the framework if needed, though.

>>> Also users probably won't care about which part of the library did
>>> produce a performance regression, so the overall numbers are indeed
>>> interesting IMHO.
>> 
>> This is not for users, this is for us :-) and we need to know which part
>> of the code introduced the regression. I agree we can start with a
>> simpler approach, but at least knowing if the problem is in a particular
>> page or all pages regressed would help a lot to identify the problem.
>
> What I meant was if the user doesn't care, we shouldn't care as well in
> the sense that it does not matter if a regression was introduced by the
> parsing or the rendering code. Fixing it will usually involve additional
> profiling using proper instrumentation in any case.
>
> Best regards, Adam.
>
>>> Especially since a developer can always do proper
>>> profiling when looking into a specific regression. Microbenchmarks, e.g.
>>> using QTest, also provide a certain balance w.r.t. these issues, as they
>>> can be used to continuously observe the performance of specific portions
>>> of the code base with more or less the same overhead as a unit test.
>>>
>>> Best regards, Adam.
>>>
>>>>> Cheers,
>>>>>   Albert
>>>>>
>>>>>>
>>>>>> The patch runs the measured commands repeatedly including warm-up
>>>>>> iterations and collects statistics from these runs. The measurement
>>>>>> results are stored as JSON documents with the actual program output of
>>>>>> e.g. pdftotext or pdftoppm being discarded.
>>>>>>
>>>>>> To implement the check for relative deviations, it abuses the checksum
>>>>>> comparison method and hence checksums are still computed for the JSON
>>>>>> documents even though they are actually unnecessary. It is also limited
>>>>>> to Unix-like operating systems (due to the use of the wait3 syscall to
>>>>>> determine resource usage similar to the time command).
>>>>>
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>> 
>

-- 
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 180 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20151231/5d0e933a/attachment-0001.sig>