[Mesa-dev] [PATCH] i965: Don't check for draw-time errors that cannot occur in core profile

Mon Aug 31 23:25:43 PDT 2015

On Tue, Sep 1, 2015 at 1:48 AM, Eirik Byrkjeflot Anonsen
<eirik at eirikba.org> wrote:
> Ian Romanick <idr at freedesktop.org> writes:
>
>> ping. :)
>>
>> On 08/10/2015 11:48 AM, Matt Turner wrote:
>>> On Mon, Aug 10, 2015 at 10:12 AM, Ian Romanick <idr at freedesktop.org> wrote:
>>>> From: Ian Romanick <ian.d.romanick at intel.com>
>>>>
>>>> On many CPU-limited applications, this is *the* hot path.  The idea is
>>>> to generate per-API versions of brw_draw_prims that elide some checks.
>>>> This patch removes render-mode and "is everything in VBOs" checks from
>>>> core-profile contexts.
>>>>
>>>> On my IVB laptop (which may have experienced thermal throttling):
>>>>
>>>> Gl32Batch7:     3.70955% +/- 1.11344%
>>>
>>> I'm getting 3.18414% +/- 0.587956% (n=113) on my IVB, , which probably
>>> matches your numbers depending on your value of n.
>>>
>>>> OglBatch7:      1.04398% +/- 0.772788%
>>>
>>> I'm getting 1.15377% +/- 1.05898% (n=34) on my IVB, which probably
>>> matches your numbers depending on your value of n.
>>
>> This is another thing that make me feel a little uncomfortable with the
>> way we've done performance measurements in the past.  If I run my test
>> before and after this patch for 121 iterations, which I have done, I can
>> cut the data at any point and oscillate between "no difference" or X%
>> +/- some-large-fraction-of-X%.  Since the before and after code for the
>> compatibility profile path should be identical, "no difference" is the
>> only believable result.
>
> That's pretty much expected, I believe. In essence, you are running 121
> tests, each with a 95% confidence interval and so should expect
> somewhere around 5 "significant difference" results. That's not entirely
> true of course, since these are not 121 *independent* tests, but the
> basic problem remains.

(more stats rants follow)

While my job title has never been 'statistician', I've been around a
bunch of them. Just want to correct this... let's forget about these
tests, but instead think about coin flips (of a potentially unfair
coin). What you're doing is flipping the coin 100 times, and then
looking at the number of times it came up heads and tails. From that
you're inferring the mean of the distribution. Obviously the more
times you do the flip, the more sure you can be of your result. The
"suredness", is expressed as a confidence interval. A 95% CI means
that for 95% such experiments (i.e. "flip a coin 100 times to
determine its true heads:tails ratio"), the *true* mean of the
distribution will lie within the confidence interval (and conversely,
for 5% of such experiments, the true mean will be outside of the
interval). Note how this is _not_ "the mean has a 95% chance of lying
in the interval" or anything like that. One of these runs of 121
iterations is a single "experiment".

Bringing this back to what you guys are doing, which is measuring some
metric (say, time), which is hardly binomial, but one might hope that
the amount of time that a particular run takes on a particular machine
at a particular commit is normal. Given that, after 100 runs, you can
estimate that the "true" mean runtime is within a CI. You're then
comparing 2 CI's to determine the % change between the two
distributions, and trying to ascertain whether they are different and
by how much.

Now, no (finite) amount of experimentation will bring you a CI of 0.
So setting out to *measure* the impact of a change is meaningless
unless you have some precise form of measurement (e.g. lines of code).
All you can do is ask the question "is the change > X". And for any
such X, you can compute the number of runs that you'd need in order to
get a CI bound that is "that tight". You could work this out
mathematically, and it depends on some of the absolute values in
question, but empirically it seems like for 50 runs, you get a CI
width of ~1%. If you're trying to demonstrate changes that are less
than 1%, or demonstrate that the change is no more than 1%, then this
is fine. If you want to demonstrate that the change is no more than
some smaller change, well, these things go like N^2, i.e. if it's 50
runs for 1%, it's 200 runs for 0.5%, etc.

This is all still subject to the normal distribution assumption as I
mentioned earlier. You could do some empirical tests and figure out
what the "not-a-normal-distribution" factor is for you, might be as
high as 1.5.

  -ilia