[Intel-gfx] [PATCH v2 2/2] drm/i915/bxt: work around HW context corruption due to coherency problem

Wed Sep 23 06:57:48 PDT 2015

On ke, 2015-09-23 at 14:39 +0100, Chris Wilson wrote:
> On Wed, Sep 23, 2015 at 03:35:59PM +0200, Daniel Vetter wrote:
> > On Fri, Sep 18, 2015 at 10:02:24AM +0100, Chris Wilson wrote:
> > > On Thu, Sep 17, 2015 at 07:17:44PM +0300, Imre Deak wrote:
> > > > The execlist context object is mapped with a CPU/GPU coherent mapping
> > > > everywhere, but on BXT A stepping due to a HW issue the coherency is not
> > > > guaranteed. To work around this flush the context object after pinning
> > > > it (to flush cache lines left by the context initialization/read-back
> > > > from backing storage) and mark it as uncached so later updates during
> > > > context switching will be coherent.
> > > > 
> > > > I noticed this problem via a GPU hang, where IPEHR pointed to an invalid
> > > > opcode value. I couldn't find this value on the ring but looking at the
> > > > contents of the active context object it turned out to be a parameter
> > > > dword of a bigger command there. The original command opcode itself
> > > > was zeroed out, based on the above I assume due to a CPU writeback of
> > > > the corresponding cacheline. When restoring the context the GPU would
> > > > jump over the zeroed out opcode and hang when trying to execute the
> > > > above parameter dword.
> > > > 
> > > > I could easily reproduce this by running igt/gem_render_copy_redux and
> > > > gem_tiled_blits/basic in parallel, but I guess it could be triggered by
> > > > anything involving frequent switches between two separate contexts. With
> > > > this workaround I couldn't reproduce the problem.
> > > > 
> > > > v2:
> > > > - instead of clflushing after updating the tail and PDP values during
> > > >   context switching, map the corresponding page as uncached to avoid a
> > > >   race between CPU and GPU, both updating the same cacheline at the same
> > > >   time (Ville)
> > > 
> > > No. Changing PAT involves a stop_machine() and is severely detrimental
> > > to performance (context creation overhead does impact userspace).
> > > Mapping it as uncached doesn't remove the race anyway.
> > 
> > Yeah it's not pretty, but otoh it's for A stepping and we'll kill it again
> > once bxt is shipping. I think with a big "IXME: dont ever dare to copy
> > this" comment this is acceptable. It's not really the worst "my gpu
> > crawls" workaround we've seend for early hw ...
> 
> Thinking about this, an incoherent TAIL write cannot cause IPEHR !=
> *ACTHD. The flush is just papering over the absence of a flush elsewhere
> and the root cause remains unfixed.

It's not the TAIL write that causing IPEHR != *ACTHD. I tried to explain
this in the commit message, I guess it was not clear. The lack of flush
atm means that after initializing the context by CPU those initialized
values will be not necessarily visible by the GPU (due to the coherency
problem). These initialized values also include the zeroed out values
returned by shmem/GUP. Eventually this context will get scheduled in/out
resulting in a context save by the GPU. After this context save we still
have the contents of the context in the CPU cache that are possibly not
coherent with the GPU. If now one of these cache lines is evicted (due
to some unrelated cache thrashing) the values stored by the GPU context
save will get overwritten by the stale CPU cache values. This is exactly
what I saw, zeroed out values in the context where there should've been
valid command dwords.

--Imre

> -Chris
>