[Intel-gfx] [passive aggressive RESEND 1/2] drm/i915: Mark CPU cache as dirty on every transition for CPU writes

Fri Jun 16 11:26:42 UTC 2017

Quoting Tvrtko Ursulin (2017-06-15 15:28:10)
> 
> On 15/06/2017 13:38, Chris Wilson wrote:
> > Currently, we only mark the CPU cache as dirty if we skip a clflush.
> > This leads to some confusion where we have to ask if the object is in
> > the write domain or missed a clflush. If we always mark the cache as
> > dirty, this becomes a much simply question to answer.
> > 
> > The goal remains to do as few clflushes as required and to do them as
> > late as possible, in the hope of deferring the work to a kthread and not
> > block the caller (e.g. execbuf, flips).
> > 
> > v2: Always call clflush before GPU execution when the cache_dirty flag
> > is set. This may cause some extra work on llc systems that migrate dirty
> > buffers back and forth - but we do try to limit that by only settin
> 
> settin*g*
> 
> Subject sounded like reviewer is wanted so I thought to give it a try.
> 
> > cache_dirty at the end of the gpu sequence.
> > 
> > v3: Always mark the cache as dirty upon a level change, as we need to
> > invalidate any stale cachelines due to external writes.
> > 
> > Reported-by: Dongwon Kim <dongwon.kim at intel.com>
> > Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Dongwon Kim <dongwon.kim at intel.com>
> > Cc: Matt Roper <matthew.d.roper at intel.com>
> > Tested-by: Dongwon Kim <dongwon.kim at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c                  | 76 ++++++++++++++----------
> >   drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
> >   drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
> >   drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
> >   drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
> >   drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
> >   6 files changed, 67 insertions(+), 56 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 31cbe78171a9..b1504a829c6a 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
> >   
> >   static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> >   {
> > -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
> > +     if (obj->cache_dirty)
> >               return false >
> >       if (!i915_gem_object_is_coherent(obj))
> > @@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
> >       return st;
> >   }
> >   
> > +static void __start_cpu_write(struct drm_i915_gem_object *obj)
> > +{
> > +     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> > +     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> > +     if (cpu_write_needs_clflush(obj))
> > +             obj->cache_dirty = true;
> 
> I find the logic here quite hard to understand because 
> cpu_write_needs_clflush is actually consulting obj->cache_dirty so it's 
> all a bit recursive.
> 
> If we start a cpu write on an object which has been clflushed, but 
> otherwise needs clflushing like pin_display - what is the correct thing 
> to do? Not set obj->cache_dirty to true?

No, we want cache_dirty to be true as we want a clflush after the write.
Really that obj->pin_display only exists there for the transition inside
set-cache-level for i915_gem_object_pin_to_display_plane. Outside of
set-cache-level obj->cache_coherent will be accurate. I can pencil in
making yet another change to limit use of obj->pin_display again.

> > +}
> > +
> >   static void
> >   __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
> >                               struct sg_table *pages,
> > @@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
> >           !i915_gem_object_is_coherent(obj))
> >               drm_clflush_sg(pages);
> >   
> > -     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> > -     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> > +     __start_cpu_write(obj);
> >   }
> >   
> >   static void
> > @@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
> >                              args->size, &args->handle);
> >   }
> >   
> > +static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > +{
> > +     return !(obj->cache_level == I915_CACHE_NONE ||
> > +              obj->cache_level == I915_CACHE_WT);
> 
> Hm, is this reversed? On LLC I would assume we don't need explicit 
> clflush, but perhaps I am misunderstanding something.

It's accurate. We all shared the same misunderstanding for several
years. The gpu writes via the ring (and the shared llc) so if we
transition to accessing the memory directly from the cpu, we bypass the
shared llc and our data is stale. So after using LLC, be that either
with CPU writes or GPU writes, we need to clflush.

> > +}
> > +
> >   /**
> >    * Creates a new mm object and returns a handle to it.
> >    * @dev: drm device pointer
> > @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
> >       case I915_GEM_DOMAIN_CPU:
> >               i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
> >               break;
> > +
> > +     case I915_GEM_DOMAIN_RENDER:
> > +             if (gpu_write_needs_clflush(obj))
> > +                     obj->cache_dirty = true;
> > +             break;
> >       }
> >   
> >       obj->base.write_domain = 0;
> > @@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
> >        * optimizes for the case when the gpu will dirty the data
> >        * anyway again before the next pread happens.
> >        */
> > -     if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
> > +     if (!obj->cache_dirty &&
> > +         !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
> >               *needs_clflush = CLFLUSH_BEFORE;
> 
> Obviously I don't understand something critical here since once more I 
> was expecting the reverse here - if the cache is not dirty no need to 
> flush it.

This is an invalidate. We need to make sure that any stray cacheline the
CPU pulled in are invalidated prior to accessing the physical address
which was written to by the GPU out of sight of the cpu.

> > +      */
> > +     flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
> > +     if (obj->cache_dirty)
> > +             i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
> >       obj->base.write_domain = 0;
> >   }
> >   
> > @@ -3657,13 +3681,10 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
> >               }
> >       }
> >   
> > -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
> > -         i915_gem_object_is_coherent(obj))
> > -             obj->cache_dirty = true;
> > -
> >       list_for_each_entry(vma, &obj->vma_list, obj_link)
> >               vma->node.color = cache_level;
> >       obj->cache_level = cache_level;
> > +     obj->cache_dirty = true; /* Always invalidate stale cachelines */
> >   
> >       return 0;
> >   }
> > @@ -3885,9 +3906,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
> >       if (ret)
> >               return ret;
> >   
> > -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
> > -             return 0;
> > -
> >       flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
> >   
> >       /* Flush the CPU cache if it's still invalid. */
> > @@ -3899,15 +3917,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
> >       /* It should now be out of any other write domains, and we can update
> >        * the domain values for our changes.
> >        */
> > -     GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
> > +     GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
> >   
> >       /* If we're writing through the CPU, then the GPU read domains will
> >        * need to be invalidated at next use.
> >        */
> > -     if (write) {
> > -             obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> > -             obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> > -     }
> > +     if (write)
> > +             __start_cpu_write(obj);
> >   
> >       return 0;
> >   }
> > @@ -4328,6 +4344,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
> >       } else
> >               obj->cache_level = I915_CACHE_NONE;
> >   
> > +     obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> 
> Could this, here and elsewhere, together with setting of read and write 
> cpu domains, be consolidated to i915_gem_object_init?

Possibly, I did consider it, but the freedom in which callers mixed the
domains reduced the desire to refactor.
-Chris