[Mesa-dev] [PATCH] i965: Don't check for draw-time errors that cannot occur in core profile

Tue Sep 1 09:15:10 PDT 2015

On 08/31/2015 11:25 PM, Ilia Mirkin wrote:
> On Tue, Sep 1, 2015 at 1:48 AM, Eirik Byrkjeflot Anonsen
> <eirik at eirikba.org> wrote:
>> Ian Romanick <idr at freedesktop.org> writes:
>>
>>> ping. :)
>>>
>>> On 08/10/2015 11:48 AM, Matt Turner wrote:
>>>> On Mon, Aug 10, 2015 at 10:12 AM, Ian Romanick <idr at freedesktop.org> wrote:
>>>>> From: Ian Romanick <ian.d.romanick at intel.com>
>>>>>
>>>>> On many CPU-limited applications, this is *the* hot path.  The idea is
>>>>> to generate per-API versions of brw_draw_prims that elide some checks.
>>>>> This patch removes render-mode and "is everything in VBOs" checks from
>>>>> core-profile contexts.
>>>>>
>>>>> On my IVB laptop (which may have experienced thermal throttling):
>>>>>
>>>>> Gl32Batch7:     3.70955% +/- 1.11344%
>>>>
>>>> I'm getting 3.18414% +/- 0.587956% (n=113) on my IVB, , which probably
>>>> matches your numbers depending on your value of n.
>>>>
>>>>> OglBatch7:      1.04398% +/- 0.772788%
>>>>
>>>> I'm getting 1.15377% +/- 1.05898% (n=34) on my IVB, which probably
>>>> matches your numbers depending on your value of n.
>>>
>>> This is another thing that make me feel a little uncomfortable with the
>>> way we've done performance measurements in the past.  If I run my test
>>> before and after this patch for 121 iterations, which I have done, I can
>>> cut the data at any point and oscillate between "no difference" or X%
>>> +/- some-large-fraction-of-X%.  Since the before and after code for the
>>> compatibility profile path should be identical, "no difference" is the
>>> only believable result.
>>
>> That's pretty much expected, I believe. In essence, you are running 121
>> tests, each with a 95% confidence interval and so should expect
>> somewhere around 5 "significant difference" results. That's not entirely
>> true of course, since these are not 121 *independent* tests, but the
>> basic problem remains.
> 
> (more stats rants follow)
> 
> While my job title has never been 'statistician', I've been around a
> bunch of them. Just want to correct this... let's forget about these
> tests, but instead think about coin flips (of a potentially unfair
> coin). What you're doing is flipping the coin 100 times, and then
> looking at the number of times it came up heads and tails. From that
> you're inferring the mean of the distribution. Obviously the more
> times you do the flip, the more sure you can be of your result. The
> "suredness", is expressed as a confidence interval. A 95% CI means
> that for 95% such experiments (i.e. "flip a coin 100 times to
> determine its true heads:tails ratio"), the *true* mean of the
> distribution will lie within the confidence interval (and conversely,
> for 5% of such experiments, the true mean will be outside of the
> interval). Note how this is _not_ "the mean has a 95% chance of lying
> in the interval" or anything like that. One of these runs of 121
> iterations is a single "experiment".
> 
> Bringing this back to what you guys are doing, which is measuring some
> metric (say, time), which is hardly binomial, but one might hope that

For the particular test I'm looking at here, I think it should be
reasonably close.  The test itself runs a small set of frames a few
times (3 or 4) and logs the average FPS for the whole run.  It seems
like the "distribution of means is Gaussian" should apply, yeah?

> the amount of time that a particular run takes on a particular machine
> at a particular commit is normal. Given that, after 100 runs, you can
> estimate that the "true" mean runtime is within a CI. You're then
> comparing 2 CI's to determine the % change between the two
> distributions, and trying to ascertain whether they are different and
> by how much.
> 
> Now, no (finite) amount of experimentation will bring you a CI of 0.
> So setting out to *measure* the impact of a change is meaningless
> unless you have some precise form of measurement (e.g. lines of code).
> All you can do is ask the question "is the change > X". And for any
> such X, you can compute the number of runs that you'd need in order to
> get a CI bound that is "that tight". You could work this out
> mathematically, and it depends on some of the absolute values in
> question, but empirically it seems like for 50 runs, you get a CI
> width of ~1%. If you're trying to demonstrate changes that are less
> than 1%, or demonstrate that the change is no more than 1%, then this
> is fine. If you want to demonstrate that the change is no more than
> some smaller change, well, these things go like N^2, i.e. if it's 50
> runs for 1%, it's 200 runs for 0.5%, etc.

That sounds familiar... that the amount of expected difference
determines the lower bound on the number of required experiments.  I did
take "Statistics for Engineers" not that long ago.  Lol.  I think I
still have my textbook.  I'll dig around in it.

For a bunch of the small changes, I don't care too much what the
difference is.  I just want to know whether after is better than before.

> This is all still subject to the normal distribution assumption as I
> mentioned earlier. You could do some empirical tests and figure out
> what the "not-a-normal-distribution" factor is for you, might be as
> high as 1.5.
> 
>   -ilia