[Intel-gfx] force wake reference counting (another try)

Tue Apr 12 18:30:22 CEST 2011

On Tue, Apr 12, 2011 at 09:02:21AM +0100, Chris Wilson wrote:
> On Mon, 11 Apr 2011 18:01:15 -0700, Ben Widawsky <ben at bwidawsk.net> wrote:
> > So once again, I expect this patch to potentially generate a lot of
> > warnings, but I consider all of those warnings to be serious bugs for
> > SNB.
> > 
> > If anyone has clever ideas on how to handle this outside of what I've
> > already mentioned, please let me know.
> > 
> > I expect ongoing patches to fix these issues as they come up.
> 
> Continuing the general discussion from IRC, we really do need to clarify
> which locks we expect to be holding when reading different sets of
> registers.  Along with similar documentation over which parts of our
> structures are covered by struct_mutex, mode_config.lock and the ensemble
> of spinlocks.  This task is not limited to just our driver, but we need to
> review the core as well as looking at how to improve the locking overall.
> (The clear example that something sucks is that probing outputs causes a
> rendering stall (fortunately less noticeable on recent hardware where
> hotplug is prevalent and the polling itself is much quicker, but still
> present.)

No doubt about it. It seems quite difficult to do with the amount of
change from generation to generation. It looks like we even have
register changes between steppings.

> 
> The only concern I have with the extra warnings are if we leave false
> positives in the code we submit upstream. Those warnings will quickly
> become regression reports and angry users. Alas, no clever ideas.
> -Chris

I am going to spend at least a day tracking down, and hopefully fixing
warnings if you agree with my next statement that it is in fact a
problem. My hope is there aren't too many cases.

The assertion I'm making is these are not false positives. I don't have
data yet to suggest this is happening, and I don't know the code well
enough to have a hunch one way or the other. However, here are two
possible examples of why it's broken with both the current, and
refcounted code.

Current:
Thread A acquires struct_mutex
Thread A goes to wake up the GT (takes a while)
Thread B acquires congif.mutex
Thread A finishes wake up, does the read
Thread B goes to wake up the GT (gets out fast because it's awake)
Thread A finishes, puts the GT to sleep
Thread B reads are potentially corrupted from here on

Refcounted:
Thread A acquires struct_mutex
Thread A increments refcounts, starts waking GT
Thread B acquires config.mutex
Thread B goes to wake GT, refcount is 1, does the reads before GT is
	actually awake

Ben