[Intel-gfx] [PATCH v2 2/2] drm/i915/bxt: Fix inadvertent CPU snooping due to incorrect MOCS config

Thu Apr 28 08:38:55 UTC 2016

On to, 2016-04-28 at 10:17 +0200, Daniel Vetter wrote:
> On Tue, Apr 26, 2016 at 07:01:06PM +0300, Imre Deak wrote:
> > On ti, 2016-04-26 at 15:42 +0100, Chris Wilson wrote:
> > > On Tue, Apr 26, 2016 at 05:26:43PM +0300, Eero Tamminen wrote:
> > > > Hi,
> > > > 
> > > > On 26.04.2016 16:23, Chris Wilson wrote:
> > > > > On Tue, Apr 26, 2016 at 04:17:55PM +0300, Imre Deak wrote:
> > > > > > On ti, 2016-04-26 at 13:57 +0100, Chris Wilson wrote:
> > > > > > > On Tue, Apr 26, 2016 at 03:44:22PM +0300, Imre Deak wrote:
> > > > > > > > Setting a write-back cache policy in the MOCS entry
> > > > > > > > definition also
> > > > > > > > implies snooping, which has a considerable overhead. This
> > > > > > > > is
> > > > > > > > unexpected for a few reasons:
> > > > > > > 
> > > > > > > If it is snooping, then I don't see why it is undesirable to
> > > > > > > have it
> > > > > > > available in a mocs setting. If it is bogus and the bit is
> > > > > > > undefined,
> > > > > > > then by all means remove it.
> > > > > > 
> > > > > > None of these entries are used alone for coherent surfaces. For
> > > > > > that
> > > > > > the application would have to use entry index#1 or #2 _and_
> > > > > > call the
> > > > > > set caching IOCTL to set the corresponding buffer to be cached.
> > > > > 
> > > > > No, the application doesn't. There are sufficent interfaces
> > > > > exposed that
> > > > > userspace can bypass the kernel if it so desired.
> > > > > 
> > > > > > The
> > > > > > problem is that without setting the buffer to be cacheable the
> > > > > > expectation is that we won't be snooping and incur the
> > > > > > corresponding
> > > > > > overhead. This is what this patch addresses.
> > > > > 
> > > > > Not true.
> > > > > 
> > > > > > The bit is also bogus, if we wanted snooping via MOCS we'd use
> > > > > > the
> > > > > > dedicated HW flag for that.
> > > > > 
> > > > > But you keep saying this bit *enables* snooping. So either it
> > > > > does or it
> > > > > doesn't.
> > > > > 
> > > > > > If we wanted to have a snooping MOCS entry we should add that
> > > > > > separately (as a forth entry), but we'd need this change as a
> > > > > > fix for
> > > > > > current users.
> > > > > 
> > > > > The current users who are getting what they request but don't
> > > > > know what
> > > > > they were requesting?
> > > > 
> > > > What this kernel ABI (index entry #2) has been agreed & documented
> > > > to provide?
> > > 
> > > The ABI is what we agree makes sense between hardware / kernel /
> > > userspace, and then we stick to it.
> > > 
> > > > I thought this entry is supposed to replace the writeback LLC/eLLC
> > > > cache MOCS setting Mesa is using on (e.g. BDW) to speed up accesses
> > > > to a memory area which it knows always to be accessed so that it
> > > > can
> > > > be cached.
> > > 
> > > Which it does... Only it speeds us writeback from the CPU, not the
> > > GPU. ;)
> > > 
> > > The confusion seems to be in mistaking !llc for llc. We have to come
> > > to
> > > some agreement on whether it makes sense having multiple entries for
> > > the
> > > same follows-PTE mocs or whether it makes more sense to expose the hw
> > > capabilties.
> > 
> > Note that we can't just say to Mesa to use index #0 on BXT instead of
> > index #2, since index #0 turns off caching in GPU L3, so we'd have to
> > also redefine that to be L3 cached. And I don't know what turning on L3
> > caching for index #0 would break, I would rather avoid doing that. Imo
> > defining the entries the following way would allow us to use the same
> > indexes on GEN9 regardless of it being SKL or BXT:
> > #0: uncached (neither in L3 nor in (e)LLC)
> > #1: PTE passthrough
> 
> So our rendercpy in igt does set MOCS entry 1. Or how exactly do all the
> set_caching vs. rendercpy tests we currently have pass? Just not?

We don't have tests for coherent surfaces. The current rendercpy test
just uses uncached buffers, so they are flushed before checking the
result. I could add a new subtest to rendercpy to test with a
cached/coherent surface.

> Also, you're guaranateeing that opencl/libva don't screw this up either?

If they don't set the given buffer to be cached via the set_caching
IOCTL (as a consequence making them coherent) they are already screwed
on CHV. If they call the IOCTL they are fine on BXT too.

> It's UABI, and we've botched it. Smells like we need to have a lot more
> solid story and make sure we get this right this time around. There's also
> the "how does dma-buf mmap support fit in" question.
> 
> Oh and nope, none of the relevant testcases are in BAT at all.
> 
> > #2: cached everywhere (L3 + (e)LLC if it exists), non-coherent
> > #3: cached everywhere, coherent
> > 
> > I'm not sure if there is even a user for coherent surfaces atm, so then
> > we could delay adding #3 until it's really needed.
> 
> The problem with the above is that the various access paths on SoC
> (without LLC) and big core (with llc) are massively different. You need
> lots of different cases in upload/download paths for max speed anyway.

The current assumption everywhere in user space is the above three
entries with their definition above as far as I know. There are no
users that would use a MOCS entry to get a coherent surface, we could
add that as a 4th entry once a user arises.

--Imre