[Intel-gfx] [PATCH] drm/i915: Always wake the device to flush the GTT

Wed Aug 30 12:56:40 UTC 2017

Quoting Daniel Vetter (2017-08-30 13:23:56)
> On Tue, Aug 29, 2017 at 03:59:36PM +0100, Chris Wilson wrote:
> > Quoting Joonas Lahtinen (2017-08-29 15:54:06)
> > > On Tue, 2017-08-29 at 11:33 +0100, Chris Wilson wrote:
> > > > Since we hold the device wakeref when writing through the GTT (otherwise
> > > > the writes would fail), we presumed that before the device sleeps those
> > > > writes would naturally be flushed and that we wouldn't need our mmio
> > > > read trick. However, that presumption seems false and a sleepy bxt seems
> > > > to require us to always manually flush the GTT writes prior to direct
> > > > access.
> > > > 
> > > > Fixes: e2a2aa36a509 ("drm/i915: Check we have an wake device before flushing GTT writes")
> > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > > 
> > > Got any Bugzilla, Testcase, Tested-by?
> > 
> > Original bugzilla hasn't been reopened, so I its looks like they were
> > happy enough with the original patches that fixed the problem on my bxt.
> > The testcase seems to be very system dependent, my suspicion is that it
> > has to do with the wacky runtime pm exhibited by CI bxt.
> 
> CI bxt doesn't have displays, which means we shut down a lot more when
> it's running. Does this indicate a huge gem test gap where we should run
> plenty of gem testcases with all the outputs shut down?

This one is hard to tell since we are guessing at how the hw actually
works. Strong PCI ordering it is not.

If we take this example at face value, the key point of failure was
rpm_get_if_in_use, so we could simply say that we need to ensure that
all such branches are exercised, with varying amounts of stress since we
are looking for a random hw delay.

At the moment that boils down to the shrinker being avoiding unbinding
anything whilst the device is idle, pushing us closer to oom (with
kswapd hopefully riding to the rescue, and we all know how unreliable
kswapd is).

The wacky part of CI suspend seems to be that there's no reason for the
device to wake at times, quite a few of the tests are just burning
cycles without touching the hw and we still have the constant stream of
suspend/resume. I'm very suspicious that we are waking up too often (and
that it takes too long, about 28ms including the hpd of a headless
machine).

> Or just the need to add a pile more tests to pm_rpm?

No. It's just your regular combinatorial explosion. The approach I would
take here would be to register a sysenter callback that attempted to do a
rpm suspend (i.e. so ~every ioctl would start from idle, and controlled
via the faultinjection framework) and then run the minimal test set that
exercises all ioctl paths, and then expand to all driver branches.

First we need coverage feedback.
-Chris