[Intel-gfx] [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu

Tue Aug 6 20:03:01 CEST 2013

On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote:
> On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> > On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> > > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > > +{
> > > +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> > > +		return true;
> > > +
> > > +	return atomic_read(&obj->fb_count) > 0;
> > > +}
> > 
> > I was thinking a bit about this fb_count thing. The other option would
> > be to track actual scanout activity, which wouldn't be hard to do, but
> > I guess that could shift a bunch of the clflush cost to page flip time.
> > So maybe we don't want that?
> 
> Different series, see the frontbuffer write tracking. As for page flip,
> that is currently the only time we clflush...
>  
> > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> > indication whether scanout is happening. We should add an fb_count
> > check there as well. Maybe we should leave the pin_count check there,
> > so that framebuffers that aren't being scanned out currently can be
> > massaged a bit more freely before a clflush needs to be issued?
> 
> It does do a fb_count check, the pin_count + fb_count is a really strong
> indicator of active scanout.

The fb_count checks is done only through the eventual check in
i915_gem_clflush_object() which is why I didn't think of it.

On non-LLC platforms it also ends up doing the flush for all
non-snooped non-scanout buffers, which would get optimized away if the
fb_count check was where the pin_count check is.

> 
> But sw_finish is truly legacy, so it just has to work and be as simple
> as possible.

I see.

> 
> > > -	if (obj->cache_level == I915_CACHE_NONE &&
> > > -	    obj->tiling_mode == I915_TILING_NONE &&
> > > -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> > > +	if (obj->tiling_mode == I915_TILING_NONE &&
> > > +	    !cpu_write_needs_clflush(obj)) {
> > 
> > This would break non-LLC platforms, I think. Before, GTT writes were
> > used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
> > which makes sense since access via GTT doesn't snoop ever according
> > to the docs. Now I think it'll do GTT writes for snooped non-scanout
> > buffers.
> 
> D'oh. What I was aiming for was to always use shmem writes on snooped
> bo, irrespective of whether it is GPU hot.
> 
> s/!cpu_write_needs_clflush/cpu_write_needs_clflush/

So that means a GTT write for anything non-snooped or scanout. If it's
snooped scanout (I guess we may not really care about such a weird
combination), or non-snooped but in CPU write domain, that would
still have issues with possibly dirty data in the CPU caches.

Might be intersting to try on real hardware what the GTT access
snooping behaviour really is, especially for some modern non-LLC
system like VLV.

> 
> > 
> > >  		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
> > >  		/* Note that the gtt paths might fail with non-page-backed user
> > >  		 * pointers (e.g. gtt mappings when moving data between
> > > @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
> > >  	 * snooping behaviour occurs naturally as the result of our domain
> > >  	 * tracking.
> > >  	 */
> > > -	if (obj->cache_level != I915_CACHE_NONE)
> > > +	if (!cpu_write_needs_clflush(obj))
> > >  		return;
> > 
> > I would actually prefer to kill this check completely, and move the
> > decision whether to do a clflush to the callers.
> 
> I saw, I differ in opinion.

Opinion noted.

>  
> > There are three places that I spotted which lack such a check;
> > i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain()
> > need to flush only when !cpu_cache_is_coherent(), whereas
> > i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as
> > well.
> 
> I like the approach of trying to keep as much of the complexity inside
> i915_gem.c and exporting simple rules to the callers.

OK. Looks like you already took care of i915_gem_object_set_to_cpu_domain()
in patch 2/4, and i915_gem_object_flush_cpu_write_domain() works OK with
the check in i915_gem_clflush_object().

So i915_gem_execbuffer_move_to_gpu() is the only one that may end up
doing some needless flushes with your code.

> 
> > > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> > > index 0584480..cc5eaba 100644
> > > --- a/drivers/gpu/drm/i915/intel_display.c
> > > +++ b/drivers/gpu/drm/i915/intel_display.c
> > > @@ -9443,6 +9443,8 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
> > >  	struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
> > >  
> > >  	drm_framebuffer_cleanup(fb);
> > > +
> > > +	atomic_dec(&intel_fb->obj->fb_count);
> > 
> > Where did the the other atomic_dec() go (was in intel_fbdev_destroy())?
> 
> Missed. I blame the lack of intel_framebuffer_fini(), that would be a
> better patch first.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

-- 
Ville Syrjälä
Intel OTC