[Intel-gfx] [libdrm PATCH] intel: Make unsynchronized GTT mappings work on systems with snooping.

Sun Mar 12 17:19:17 UTC 2017

On Sun, Mar 12, 2017 at 01:21:12PM +0000, Chris Wilson wrote:
> On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> > had the surprising behavior of doing a synchronized GTT mapping.
> > This is obviously not what the user of the API wanted.
> > 
> > Eric left a comment indicating a valid concern: if the CPU and GPU
> > caches are incoherent, we don't keep track of where the user last
> > mapped the buffer, and what caches might contain relevant data.
> 
> Note this is an issue in libdrm_intel not tracking the cache domain
> transitions. Even just a switch between cpu and coherent would solve the
> majority of that - the caveat being shared bo where the tracking is
> incomplete.
>  
> > Modern Atom systems still don't have LLC, but they do offer snooping,
> > which effectively makes the caches coherent.  The kernel appears to
> > set up the PTE/PPAT to enable snooping for everything where the cache
> > level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> > are marked as uncached.
> 
> Byt, bsw beg to differ. I don't have a bxt to know the results of the
> igt/kernel tests.

Just give me a list of the tests to run (and, if any, what patches
to apply and the debugging level you want enabled) and I'll provide
the necessary results.

> > Any buffers used by scanout should be flagged as non-reusable with
> > drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> > assume that any reusable buffer should be snooped.
> 
> Not really, there is no reason why scanout buffers can't be reused.
>  
> > This patch enables unsynchronized mappings for reusable buffers
> > on all Gen6+ hardware (which have either LLC or snooping).
> > 
> > On Broxton, this improves the performance of Unigine Valley 1.0
> > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> > (same settings) by about 53%.
> 
> Does anyone have figures for gtt performance on bxt - does it cover over
> the same performance penalty from earler atoms? Basically why bother to
> enable this over wc mapping (no stalls for a contended, limited
> resource) + detiling. (Just note that for detiling Y to WC you need to
> use a temporary cacheable page, or rearrange the code to make sure the
> reads/writes are in 64 byte chunks.) 

Again, I can run any tests you'd like to get numbers from,
just give me a list.

Kind regards, David