[poppler] [RFC] Extend regtest framework to track performance

Adam Reichold adam.reichold at t-online.de
Thu Dec 31 04:47:26 PST 2015


Hello,

Am 31.12.2015 um 12:55 schrieb Carlos Garcia Campos:
> Adam Reichold <adam.reichold at t-online.de> writes:
> 
>> Hello,
>>
>> Am 30.12.2015 um 19:17 schrieb Carlos Garcia Campos:
>>> Albert Astals Cid <aacid at kde.org> writes:
>>>
>>>> El Wednesday 30 December 2015, a les 17:04:42, Adam Reichold va escriure:
>>>>> Hello again,
>>>>>
>>>>> as discussed in the code modernization thread, if we are going to make
>>>>> performance-orient changes, we need a simple way to track functional and
>>>>> performance regressions.
>>>>>
>>>>> The attached patch tries to extend the existing Python-based regtest
>>>>> framework to measure run time and memory usage to spot significant
>>>>> performance changes in the sense of relative deviations w.r.t. to these
>>>>> two parameters. It also collects the sums of both which might be used as
>>>>> "ball park" numbers to compare the performance effect of changes over
>>>>> document collections.
>>>>
>>>> Have you tried it? How stable are the numbers? For example here i get for 
>>>> rendering the same file (discarding the first time that is loading the file 
>>>> into memory) numbers that range from 620ms to 676ms, i.e. ~10% variation 
>>>> without no change at all.
>>>
>>> I haven't looked at the patches in detail yet, but I don't think the
>>> regtest framework should be the one measuring. I would add a tool for
>>> that, pdfperf or something like that, that could measure the internals
>>> (parsing, output devs, etc.) in a more detailed way. If we
>>> need to get more information from the poppler core we could just add a
>>> compile option to provide that info. And the regtest framework should
>>> just run the perf command and collect the results. A report command
>>> could compare results with refs or previous executions. I also think
>>> performance tests should be run with a different command, since we don't
>>> want to measure the performance of every single document we have in our
>>> test suite (it would take too long). Instead, we could select a set of
>>> documents with different kind of pdf features to run perf tests on those
>>> only.
>>>
>>> So, in summary, I would use a dedicated tool (depending on a build
>>> option if we need to get more info from the core), and maybe even a
>>> dedicated perf framework on top of that tool, since I consider perf tests
>>> different from rendering regression tests, the same way unit tests are
>>> handled by a different framework too.
>>
>> I agree that a dedicated tool might provide more detailed information.
>> But with the limited resources we have, even some information might be
>> useful. Of course we should make it reliable, e.g. by improving upon the
>> measurement procedure.
> 
> Yes, you are right. We can probably start with something like this. I
> think there's an intermediate solution, though. I think we could at
> least measure times per page, since there are documents with lots of
> pages, and sometimes it's one page containing a complex pattern or image
> the one causing the regression on the whole document. Measuring every
> page makes it easier to track those regressions but also provides more
> accurate measurements than whole document times. For that we would
> probably need to change the tools (pdftoppm/cairo/text) to provide
> rendering times per page (using a command line switch, for example).

This does sound like a useful addition, but I think it should also be
possible without adding benchmark options to the utilities: We can just
time several invocations that each render a single page after querying
the number of pages in the benchmark driver. The only downside to this
seems to be that this will make the individual benchmark durations
shorter and hence we will need more iterations to battle random
fluctuations.

> I'm fine with extending the regtest framework, but what we don't want
> for sure is mixing performance and rendering tests. Our current test
> suite takes a long time, and also I don't think we should test
> performance in the same way. So, I would add a different subcommand
> run-perf-test, for example, to be able to run a different subset of
> tests and storing the results differently (using a json file or
> whatever, without affecting the rendering checksums). I'm not sure
> having references like for regtests is the best approach. In this case I
> would just keep the results of every revision. And a different
> subcommand could be implemented to compare results, producing the report
> with the improvements/regressions.
> 
> What do you think?

The main motivation for the current approach - that is adding additional
backends perftext and perfsplash to do measurements - was to reuse as
much of the existing regtest code base as possible. And I think that the
patch set shows how little code is necessary to implement it in this
particular way. Of course, it is somewhat of a hack as the abuse of the
_match_checksums backend method to compute the relative deviations shows.

I also think that it does fulfil two of the above requirements: Since
these are separate backends, they will not affect the rendering
regression tests in anyway. And since we can invoke the regtest
framework with different backends and test directories, we can already
use separate document selections for checking for functional and
performance regressions.

I also agree that having a reference point is not really suitable for
performance measurements and that we should just create outputs for each
revision and generate reports about the changes over time. (But then
again, the regtest framework could generally be made to work that way.
After all, the reference revision is special only to the user as the
current implementation already shows.)

In any case, if I drop this patch and significantly diverge from the
regtest framework, I would probably just start over to have something
simple and self-contained, quite possibly implementing the separate
pdfperf application to be able to hook directly into the library as well
and hopefully reusing the IMHO rather nice benchmarking capabilities of
the QTest framework. In any case, it will probably take several days
instead of a single day like this patch.

As to which we way is taken: I will abide by the preferences of the
maintainers here: If you want the proper benchmark driver, I will gladly
try to implement that. If you can live with this patch, I will be happy
to maintain the perf backends of the regtest framework in future. (Just
a significant extension of the already quite complicated regtest
framework is something I would like to avoid.)

>> Also users probably won't care about which part of the library did
>> produce a performance regression, so the overall numbers are indeed
>> interesting IMHO.
> 
> This is not for users, this is for us :-) and we need to know which part
> of the code introduced the regression. I agree we can start with a
> simpler approach, but at least knowing if the problem is in a particular
> page or all pages regressed would help a lot to identify the problem.

What I meant was if the user doesn't care, we shouldn't care as well in
the sense that it does not matter if a regression was introduced by the
parsing or the rendering code. Fixing it will usually involve additional
profiling using proper instrumentation in any case.

Best regards, Adam.

>> Especially since a developer can always do proper
>> profiling when looking into a specific regression. Microbenchmarks, e.g.
>> using QTest, also provide a certain balance w.r.t. these issues, as they
>> can be used to continuously observe the performance of specific portions
>> of the code base with more or less the same overhead as a unit test.
>>
>> Best regards, Adam.
>>
>>>> Cheers,
>>>>   Albert
>>>>
>>>>>
>>>>> The patch runs the measured commands repeatedly including warm-up
>>>>> iterations and collects statistics from these runs. The measurement
>>>>> results are stored as JSON documents with the actual program output of
>>>>> e.g. pdftotext or pdftoppm being discarded.
>>>>>
>>>>> To implement the check for relative deviations, it abuses the checksum
>>>>> comparison method and hence checksums are still computed for the JSON
>>>>> documents even though they are actually unnecessary. It is also limited
>>>>> to Unix-like operating systems (due to the use of the wait3 syscall to
>>>>> determine resource usage similar to the time command).
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>>
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20151231/47ece87d/attachment.sig>


More information about the poppler mailing list