[Mesa-dev] [PATCH] i965: Don't check for draw-time errors that cannot occur in core profile

Tue Sep 1 12:45:15 PDT 2015

Ilia Mirkin <imirkin at alum.mit.edu> writes:

> On Tue, Sep 1, 2015 at 12:15 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> For a bunch of the small changes, I don't care too much what the
>> difference is.  I just want to know whether after is better than before.
>
> And that gets back to my comment that you can't *measure* the impact
> of a change. Not with something where the outcome is a random
> variable. It can't be done.
>
> All you can do is answer the question "is X's mean more than N higher
> than Y's mean". And you change the number of trials in an experiment
> depending on N. (There's also more advanced concepts like 'power' and
> whatnot, I've done just fine without fully understanding them, I
> suspect you can too.)

Power is (IIRC) just an opposite of the p-value. That is, the p-value
gives you the probability of false positives and the power is the
probability of false negatives (or rather 1-power, but whatever). The
complication is that you usually choose the p-value (typically 0.05)
while the power has to be calculated.

> As an aside, increasing the number of trials until you get a
> significant result is a great way to arrive at incorrect decisions,
> due to the multi-look problem (95% CI means 1/20 gives you bad
> results). The proper way is to decide beforehand "I care about changes
>>0.1%, which means I need to run 5000 trial runs"

The trick could be to run a sequence of tests until you find how many
trials are needed for significance. Then you can check if you get
repeatable results with that many trials, in which case you are safe.

The key word is of course "repeatable". If you have a correctly executed
test that gives repeatable "significant difference", it usually doesn't
matter too much how you figured out which parameters were needed.
("usually", because you could run into a choice of parameters that
invalidates the whole test. But just increasing the number of trials
shouldn't do that.)

Which brings us to the clear value of multiple people running similar
tests and getting similar results. That strengthens the conclusion
significantly.

> (based on the
> assumption that 50 runs gets you 1%). After doing the 5k runs, your CI
> width should be ~0.1% and you should then be able to see if the delta
> in means is higher or lower than that. If it's higher, then you've
> detected a significant change. If it's not, that btw doesn't mean "no
> change", just not statistically significant. There's also a procedure
> for the null hypothesis (i.e. is a change's impact <1%) which is
> basically the same thing but involves doing a few more runs (like 50%
> more? I forget the details).

Hmm, you could just formulate your null hypothesis as "the change is
greater than 1%" and then test that normally.

> Anyways, I'm sure I've bored everyone to death with these pedantic
> explanations, but IME statistics is one of the most misunderstood
> areas of math, especially among us engineers.
>
>   -ilia

What, statistics boring? No way! :)

eirik