[Intel-gfx] [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno

Thu Jun 11 12:14:04 PDT 2015

On Thu, 2015-06-11 at 09:02 +0100, Dave Gordon wrote:
> On 10/06/15 16:52, Chris Wilson wrote:
> > On Wed, Jun 10, 2015 at 06:26:58PM +0300, Imre Deak wrote:
> >> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
> >>> On 06/10/2015 03:59 AM, Imre Deak wrote:
> >>>> I think the discussion here is about two separate things:
> >>>> 1. Possible ordering issue between the seqno store and the completion
> >>>> interrupt
> >>>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
> >>>> indefinitely, which this patch works around
> >>>>
> >>>> I'm confident that in my case the problem is not due to ordering. If it
> >>>> was "only" ordering then the value would show up eventually. This is not
> >>>> the case though, __wait_for_request will see the stale value
> >>>> indefinitely even though it gets woken up periodically afterwards by the
> >>>> lost IRQ logic (with hangcheck disabled).
> >>>
> >>> Yeah, based on your workaround it sounds like the write from the CS is
> >>> landing in memory but failing to invalidate the associated CPU
> >>> cacheline.  I assume mapping the HWSP as uncached also works around this
> >>> issue?
> >>
> >> I assume it would, but it would of course have a bigger overhead.
> 
> Would it though? We shouldn't be repeatedly reading from the HWSP unless
> we're waiting for something to change, in which case we absolutely want
> the latest value without any caching whatsoever.

Normally on GEN6+ reading the cached value always gives you the very
latest value, that's the whole point of coherency between GPU and CPU
and this also holds for BXT. We use the lazy_coherency flag to work
around the relatively rare occasions when this coherency is violated for
a reason that is yet to be found (for example some missing GT workaround
on BXT). I still prefer this form of the workaround where we do an
uncached read only when it's absolutely necessary, to one where we do
uncached reads all the time. I imagine that doing such uncached reads in
__i915_spin_request() in a loop is definitely worse than cached reads
(and again these cached reads will give you the same response time as
uncached reads in general).

> Since the only thing we want to read from the HWSP is the seqno (via
> engine->get_seqno()), and that has a 'lazy_coherency' flag, we could
> cache the last value read explicitly in the intel_engine_cs and return
> that if lazy_coherency is set, and get it directly from the uncached
> HWSP only when lazy_coherency is false (i.e. it would become
> very-lazy-coherency).

engine->get_seqno() will give you the very latest value even with
lazy_coherency=true in general. The exception is when the coherency
problem is hit which should be relatively rare.

> >> Based
> >> on my testing the coherency problem happens only occasionally, so for
> >> the rest of the times we still would want to benefit from cached reads.
> >> See especially __i915_spin_request().
>
> I would have expected __i915_spin_request() NOT to allow lazy coherency,
> at least inside the loop. It could do one pre-check with lazy coherency
> on, for the presumably-common case where the request has already
> completed, but then use fully uncached reads while looping?

Based on the above, I think we have the most benefit of coherent cached
reads inside such loops.

--Imre