[Intel-gfx] [PATCH v2 2/2] drm/i915/bxt: Fix inadvertent CPU snooping due to incorrect MOCS config

Mon May 2 13:50:11 UTC 2016

On ma, 2016-05-02 at 10:28 +0200, Daniel Vetter wrote:
> On Thu, Apr 28, 2016 at 08:15:24PM +0300, Imre Deak wrote:
> > On to, 2016-04-28 at 16:48 +0200, Daniel Vetter wrote:
> > > On Thu, Apr 28, 2016 at 11:38:55AM +0300, Imre Deak wrote:
> > > > On to, 2016-04-28 at 10:17 +0200, Daniel Vetter wrote:
> > > > > Also, you're guaranateeing that opencl/libva don't screw this up
> > > > > either?
> > > > 
> > > > If they don't set the given buffer to be cached via the set_caching
> > > > IOCTL (as a consequence making them coherent) they are already
> > > > screwed
> > > > on CHV. If they call the IOCTL they are fine on BXT too.
> > > 
> > > We do implicit set_caching when displaying something to something
> > > coherent. To make that work userspace should use the "use PTE" mode by
> > > default, except when they really know what they're doing.
> > > That's also the mode that's supposed to give you the most reasonable
> > > performance. But somehow that mode ended up in MOCS entry 1, so pretty much
> > > guaranteed userspace will get it wrong. Mesa just hit a perf snag, but
> > > might as well have been visual corruption. I think it'd be a lot safe to
> > > make "use PTE" entry 0.
> > 
> > Mesa uses entries 1 and 2. If something else like opencl or libva (or
> > even Mesa for that matter) uses index 0 for PTE pass-through that's a
> > bug on its own. I don't know if this is the case, but it's a separate
> > issue from what I'm trying to fix here.
> > 
> > This isn't about a case where a PTE pass-through entry needs to be
> > provided, but about the case where a cached but non-coherent one is
> > needed. Mesa assumes this to be entry 2 and I don't see why we couldn't
> > make sure that this assumption holds. Note that this entry on BXT could
> > be both a PTE pass-through one as in this patch or LLC-UC.
> 
> Yeah my comment about entry #0 was is a different track of discussion.
> Should still fix it up while we clarify what entries 1&2 really mean.
> 
> When defining entries as "cached" please make triple sure what exactly you
> mean by that. Since eLLC, LLC and on-gpu L3$ are all different caches, in
> different parts of the coherency fabric. And L3$ has functional relevance
> since if that's not set compute features fall apart.

Based on the current MOCS table entry definitions:

index#0 - Not cached in either of GPU-L3, LLC or eLLC. Not coherent with CPU.

index#1 - Cached in GPU-L3. LLC/eLLC cacheability controlled by the kernel (via PTE). Coherency with CPU controlled by the kernel (via PTE).

index#2 - Cached in GPU-L3. Cached in both LLC and eLLC on platforms where these caches exist. Not coherent with CPU.

> So maybe a better definition would be "L3$ cached (useful for
> compute)+performance optimized otherwise for general use+might be
> incoherent depending upon platform" for entry 2.

Do you mean not to cache it in LLC? That would be a change on SKL/KBL
where entry 2 has LLC/eLLC cacheing enabled atm. I think we should keep
that enabled.

> That would make sense and
> covers it all, but imo yours a bit too simple (assuming my understanding
> of cache architecture on gen9 is accurate, they change it all the bloody
> time).

Well, my definition was "cached in all caches available to the GPU in
the given platform and not coherent with the CPU". This also covers it
all, but we can make it more explicit as in the description of index#2
above. We can also come up with a higher level description like
"optimized for speed" as you suggest, although I think performance
would depend anyway on the mapped object type/size. In any case we
should make it clear in the definition what's the significance of
coherency vs. non-coherency (i.e. GPU access overhead).

> Note that e.g. on older platforms we could enable L3$ either in the
> PTE, or in MOCS settings in the batch itself, which is why the "useful for
> compute" only started to become relevant for gen9. And why we needed the
> kernel MOCS patch really.
> 
> Simililarly we'd need to come up with something accurate for #1 and #0
> (and it sounds like they both need to mean PTE default per kernel). Hooray
> for wasting one entry ;-)

Yes I guess entry 0 could and should be redefined (but as a follow-up).

--Imre