[Intel-gfx] [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno

Thu Jun 11 01:02:44 PDT 2015

On 10/06/15 16:52, Chris Wilson wrote:
> On Wed, Jun 10, 2015 at 06:26:58PM +0300, Imre Deak wrote:
>> On ke, 2015-06-10 at 08:10 -0700, Jesse Barnes wrote:
>>> On 06/10/2015 03:59 AM, Imre Deak wrote:
>>>> I think the discussion here is about two separate things:
>>>> 1. Possible ordering issue between the seqno store and the completion
>>>> interrupt
>>>> 2. Coherency issue that leaves the CPU with a stale view of the seqno
>>>> indefinitely, which this patch works around
>>>>
>>>> I'm confident that in my case the problem is not due to ordering. If it
>>>> was "only" ordering then the value would show up eventually. This is not
>>>> the case though, __wait_for_request will see the stale value
>>>> indefinitely even though it gets woken up periodically afterwards by the
>>>> lost IRQ logic (with hangcheck disabled).
>>>
>>> Yeah, based on your workaround it sounds like the write from the CS is
>>> landing in memory but failing to invalidate the associated CPU
>>> cacheline.  I assume mapping the HWSP as uncached also works around this
>>> issue?
>>
>> I assume it would, but it would of course have a bigger overhead.

Would it though? We shouldn't be repeatedly reading from the HWSP unless
we're waiting for something to change, in which case we absolutely want
the latest value without any caching whatsoever.

Since the only thing we want to read from the HWSP is the seqno (via
engine->get_seqno()), and that has a 'lazy_coherency' flag, we could
cache the last value read explicitly in the intel_engine_cs and return
that if lazy_coherency is set, and get it directly from the uncached
HWSP only when lazy_coherency is false (i.e. it would become
very-lazy-coherency).

>> Based
>> on my testing the coherency problem happens only occasionally, so for
>> the rest of the times we still would want to benefit from cached reads.
>> See especially __i915_spin_request().

I would have expected __i915_spin_request() NOT to allow lazy coherency,
at least inside the loop. It could do one pre-check with lazy coherency
on, for the presumably-common case where the request has already
completed, but then use fully uncached reads while looping?

> Yeah, I am not quite sure about that myself. The reason we don't do
> forced coherent reads there is because of the impact that has with fw
> and the snb/ivb/hsw workaround.  If we have a viable strategy that gives us
> accurate seqno at marginal cost, then it will be beneficial to do so from within
> the busy-spin - in order to minimise the amount of cycles we spin.

If we mapped the HWSP uncached, and used a GPU instruction that
guaranteed making the seqno update globally-visible ASAP, we might not
need to read a register at all, and then we wouldn't have the
complications with forcewake :)

> We always spend a jiffie of CPU time, we might as spend it getting
> accurate results (so the coherent read is quick enough and doesn't slow
> down the normal case).
> 
> I seem to recall that we tried clflushing for gen6+, but we didn't
> record any details, so it may be worth retesting ivb with that w/a.
> -Chris

Is that clflush on the CPU, or MI_CLFLUSH on the GPU? Does the latter
help at all?

.Dave.