[Mesa-dev] [PATCH] i965: Don't check for draw-time errors that cannot occur in core profile

Mon Aug 31 22:48:31 PDT 2015

Ian Romanick <idr at freedesktop.org> writes:

> ping. :)
>
> On 08/10/2015 11:48 AM, Matt Turner wrote:
>> On Mon, Aug 10, 2015 at 10:12 AM, Ian Romanick <idr at freedesktop.org> wrote:
>>> From: Ian Romanick <ian.d.romanick at intel.com>
>>>
>>> On many CPU-limited applications, this is *the* hot path.  The idea is
>>> to generate per-API versions of brw_draw_prims that elide some checks.
>>> This patch removes render-mode and "is everything in VBOs" checks from
>>> core-profile contexts.
>>>
>>> On my IVB laptop (which may have experienced thermal throttling):
>>>
>>> Gl32Batch7:     3.70955% +/- 1.11344%
>> 
>> I'm getting 3.18414% +/- 0.587956% (n=113) on my IVB, , which probably
>> matches your numbers depending on your value of n.
>> 
>>> OglBatch7:      1.04398% +/- 0.772788%
>> 
>> I'm getting 1.15377% +/- 1.05898% (n=34) on my IVB, which probably
>> matches your numbers depending on your value of n.
>
> This is another thing that make me feel a little uncomfortable with the
> way we've done performance measurements in the past.  If I run my test
> before and after this patch for 121 iterations, which I have done, I can
> cut the data at any point and oscillate between "no difference" or X%
> +/- some-large-fraction-of-X%.  Since the before and after code for the
> compatibility profile path should be identical, "no difference" is the
> only believable result.

That's pretty much expected, I believe. In essence, you are running 121
tests, each with a 95% confidence interval and so should expect
somewhere around 5 "significant difference" results. That's not entirely
true of course, since these are not 121 *independent* tests, but the
basic problem remains.

You need to decide up front how many iterations you will run before
looking at the result. And you will still sometimes get the "wrong"
result. That's statistics for you.

Now, as Ilia said, this depends on the distribution to be normal, which
it is emphatically not. But figuring out what effect that has on your
results will take a proper statistician, which I am not :)

Of course, an obvious improvement would be to always run the test twice
and if any of those gives you a close-to-no-difference result, run it
another time. If both (or all three) runs give you very similar results,
you would have much better confidence in your result.

In this case you would want to run both/all tests with the same number
of iterations to get comparable numbers. But if those tests give
close-to-no-difference result, it would also be possible to run another
test with more iterations, which should tighten up the confidence
interval (and thus give a more certain result).

> Using a higher confidence threshold (e.g., -c 98) results in "no
> difference" throughout, as expected.  I feel like 90% isn't a tight
> enough confidence interval for a lot of what we do, but I'm unsure how
> to determine what confidence level we should use.  We could
> experimentally determine it by running a test some number of times and
> finding the interval that detects no change in some random partitioning
> of the test results.  Ugh.

But that only attacks half the problem. You also want differences to be
flagged when there really is one. So you would also have to work out
which level of actual, real improvements you are happy about discarding
as "no difference".

Another "obvious" way to improve the results would be to use a better
analysis of the data than what is currently done. But that brings me
back to the "I'm not a statistician" bit.

eirik