[Intel-gfx] [PATCH 8/8] drm/i915: NULL check of unpin_work

Tue Oct 13 04:51:47 PDT 2015

On Fri, Oct 09, 2015 at 01:06:24PM +0100, Tomas Elf wrote:
> On 09/10/2015 11:44, Chris Wilson wrote:
> >On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote:
> >>On 09/10/2015 08:46, Chris Wilson wrote:
> >>>On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
> >>>>Avoid NULL pointer exceptions in the display driver for certain critical cases
> >>>>when unpin_work has turned out to be NULL.
> >>>
> >>>Nope, the machine reached a point where it cannot possibly reach, we
> >>>want the OOPS.
> >>>-Chris
> >>>
> >>
> >>Really? Because if I have this in there I can actually run the
> >>long-duration stability tests for 12+ hours rather than just a few
> >>hours (it doesn't happen all the time but I've seen it at least
> >>once). But, hey, if you want to investigate the reason why we have a
> >>NULL there then go ahead. I guess I'll just have to carry this patch
> >>for as long as my machine crashes during my TDR testing.
> >
> >You've sat on a critical bug for how long? There's been one recent
> >report that appears to be fallout from Tvrtko's context cleanup, but
> >nothing older, in public at least.
> >-Chris
> >
> 
> Do people typically try to actively hang their machines several times a
> minute for a 12+ hours at a time? If not then maybe that's why they haven't
> seen this.

The problem is that the world loves to conspire against races like these,
and somewhere out there is a machine which hits this ridiculously
reliably. And if this is a machine sitting on an OEM's product engineer
bench we've just lost a sale ;-)

Just because you can't repro if fast doesn't mean that your reproduction
rate is the lower limit.

> I haven't sat on anything, this has been part of my error state capture
> improvement patches which I've intended to upstream for several months now.
> I consider this part of the overall stability issue and grouped it together
> with all the other patches necessary to make the machine not crash for the
> duration of my test. I would've upstreamed this series a long time ago if I
> had actually been given time to work on it but that's another discussion.

I guess Chris' reaction is because we've had tons of cases where bugfixes
like this where stuck in android trees because they where tied up with
some feature work. But I also know it's really hard to spot them, so in
casee of doubt please upstream small fixes aggressively.

Since this one here is outside of the error capture itself, and so code
where our existing assumption that error capture only runs when the driver
is dead since seconds is invalid, it's a prime candidate.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch