[Mesa-dev] Mesa CI with trace regression testing

Mon Sep 30 16:56:46 UTC 2019

Eero Tamminen <eero.t.tamminen at intel.com> writes:

> Hi,
>
> On 27.9.2019 4.32, Eric Anholt wrote:
>> Alexandros Frantzis <alexandros.frantzis at collabora.com> writes:
>>> The last couple of months we (at Collabora) have been working on a
>>> prototype for a Mesa testing system based on trace replays that supports
>>> correctness regression testing and, in the future, performance
>>> regression testing.
>>>
>>> We are aware that large-scale CI systems that perform extensive checks
>>> on Mesa already exist. However, our goal is not to reach that kind of
>>> scale or exhaustiveness, but to produce a system that will be simple and
>>> robust enough to be maintained by the community, while being useful
>>> enough so that the community will want to use and maintain it. We also
>>> want to be able to make it fast enough so that it will be run eventually
>>> on a regular basis, ideally in pre-commit fashion.
>>>
>>> The current prototype focuses on the correctness aspect, replaying
>>> traces and comparing images against a set of reference images on
>>> multiple devices. At the moment, we run on softpipe and
>>> intel/chromebook, but it's straightforward to add other devices through
>>> gitlab runners.
>>>
>>> For the prototype we have used a simple approach for image comparison,
>>> storing a separate set of reference images per device and using exact
>>> image comparison, but we are also investigating alternative ways to deal
>>> with this. First results indicate that the frequency of reference image
>>> mismatches due to non-bug changes in Mesa is acceptable, but we will get
>>> a more complete picture once we have a richer set of traces and a longer
>>> CI run history.
>
> For CI, I think discarding/ignoring too unstable / slow traces would
> be perfectly acceptable. [1]
>
>
>> Some missing context: I was told that over 2400 commits, in glmark2 + a
>> couple of other open source traces, on intel, there was one spurious
>> failure due to this diff method.  This is lower than I felt like it was
>> when I did this in piglit on vc4, but then I was very actively changing
>> optimization in the compiler while I was using that tool.
>
> Few years ago when I was looking at the results from ezBench (at
> the same time) bisecting Mesa commit ranges for build, run-time,
> performance and rendering issues, it was very useful to have
> rendering diff results in addition to performance numbers.
>
> Rendering didn't change too often, but one needs to look at every change
> screenshot directly, error metrics about them aren't enough.  Some
> innocent accuracy difference due to calculation order change can cause
> e.g. marginally different color on huge area on the rendered result [1],
> whereas some real rendering error can be just some tiny reflection
> missing from the render, which one would never see from running
> benchmark (even with correct one running beside it), one sees them only
> from static screenshots.
>
> [1] Whether screenshots are affected by calculation changes, depends
> a lot on the benchmark, how stable its calculations are in regards to
> accuracy variations. Some benchmarks even use random in their shaders...
>
> (If I remember correctly, good example of unstable results were some
> of the GpuTest benchmarks.)
>
>
>>> The current design is based on an out-of-tree approach, where the tracie
>>> CI works independently from Mesa CI, fetching and building the latest
>>> Mesa on its own. We did this for maximum flexibility in the prototyping
>>> phase, but this has a complexity cost, and although we could continue to
>>> work this way, we would like to hear people's thoughts about eventually
>>> integrating with Mesa more closely, by becoming part of the upstream
>>> Mesa testing pipelines.
>>>
>>> It's worth noting that the last few months other people, most notably
>>> Eric Anholt, have made proposals to extend the scope of testing in CI.
>>> We believe there is much common ground here (multiple devices,
>>> deployment with gitlab runners) and room for cooperation and eventual
>>> integration into upstream Mesa. In the end, the main difference between
>>> all these efforts are the kind of tests (deqp, traces, performance) that
>>> are being run, which all have their place and offer different
>>> trade-offs.
>>>
>>> We have also implemented a prototype dashboard to display the results,
>>> which we have deployed at:
>>>
>>> https://tracie.freedesktop.org
>>>
>>> We are working to improve the dashboard and provide more value by
>>> extracting and displaying additional information, e.g., "softpipe broken
>>> since commit NNN".
>>>
>>> The dashboard is currently specific to the trace playback results, but
>>> it would be nice to eventually converge to a single MesaCI dashboard
>>> covering all kinds of Mesa CI test results. We would be happy to help
>>> develop in this direction if there is interest.
>>>
>>> You can find the CI scripts for tracie at:
>>>
>>> https://gitlab.freedesktop.org/gfx-ci/tracie/tracie
>>>
>>> Code for the dashboard is at:
>>>
>>> https://gitlab.freedesktop.org/gfx-ci/tracie/tracie_dashboard
>>>
>>> Here is an example of a failed CI job (for a purposefully broken Mesa
>>> commit) and the report of the failed trace (click on the red X to
>>> see the image diffs):
>>>
>>> https://tracie.freedesktop.org/dashboard/job/642369/
>>>
>>> Looking forward to your thoughts and comments.
>> 
>> A couple of thoughts on this:
>> 
>> A separate dashboard is useful if we have traces that are too slow to
>> run pre-merge or are not redistributable.  For traces that are
>> redistributable and cheap to run, we should run them in our CI and block
>> the merge instead of having someone have to watch an external dashboard
>> and report things to get patched up after regressions have already
>> landed.
>> 
>> I'm reluctant to add "maintain a web service codebase" as one of the
>> things that the Mesa project does, if there are alternatives that don't
>> involve that.  I've been thinking about a perf dashboard, and for that
>> I'd like to reuse existing open source projects like grafana.  If we
>> start our own dashboard project, are we going to end up reimplementing
>> that one?
>
> FYI: We're tried Grafana and while it's otherwise nice and fast, we
> didn't find a way to get each data point in graph to be a link to
> additional data (logs, screenshots etc), which IMHO makes it much less
> useful for trend tracking.
>
> (If there actually *is* a way to add links to each data point, I would
> be very much interested.)

Looks like one can do so at the graph level now:

https://grafana.com/docs/features/panels/graph/

I thought there was something else for your metric being able to have
arbitrary extra data attached, which I'm not finding in a quick search
right now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190930/d13a3d0b/attachment.sig>