[Intel-gfx] [passive aggressive RESEND 1/2] drm/i915: Mark CPU cache as dirty on every transition for CPU writes

Fri Jun 16 12:59:38 UTC 2017

On 16/06/2017 13:13, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-06-16 13:05:21)
>>
>> On 16/06/2017 12:26, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2017-06-15 15:28:10)
>>>>
>>>> On 15/06/2017 13:38, Chris Wilson wrote:
>>>>> Currently, we only mark the CPU cache as dirty if we skip a clflush.
>>>>> This leads to some confusion where we have to ask if the object is in
>>>>> the write domain or missed a clflush. If we always mark the cache as
>>>>> dirty, this becomes a much simply question to answer.
>>>>>
>>>>> The goal remains to do as few clflushes as required and to do them as
>>>>> late as possible, in the hope of deferring the work to a kthread and not
>>>>> block the caller (e.g. execbuf, flips).
>>>>>
>>>>> v2: Always call clflush before GPU execution when the cache_dirty flag
>>>>> is set. This may cause some extra work on llc systems that migrate dirty
>>>>> buffers back and forth - but we do try to limit that by only settin
>>>>
>>>> settin*g*
>>>>
>>>> Subject sounded like reviewer is wanted so I thought to give it a try.
>>>>
>>>>> cache_dirty at the end of the gpu sequence.
>>>>>
>>>>> v3: Always mark the cache as dirty upon a level change, as we need to
>>>>> invalidate any stale cachelines due to external writes.
>>>>>
>>>>> Reported-by: Dongwon Kim <dongwon.kim at intel.com>
>>>>> Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Dongwon Kim <dongwon.kim at intel.com>
>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>> Tested-by: Dongwon Kim <dongwon.kim at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_gem.c                  | 76 ++++++++++++++----------
>>>>>     drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
>>>>>     drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
>>>>>     drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
>>>>>     drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
>>>>>     drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
>>>>>     6 files changed, 67 insertions(+), 56 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 31cbe78171a9..b1504a829c6a 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
>>>>>     
>>>>>     static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>>>>     {
>>>>> -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
>>>>> +     if (obj->cache_dirty)
>>>>>                 return false >
>>>>>         if (!i915_gem_object_is_coherent(obj))
>>>>> @@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
>>>>>         return st;
>>>>>     }
>>>>>     
>>>>> +static void __start_cpu_write(struct drm_i915_gem_object *obj)
>>>>> +{
>>>>> +     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>>>> +     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>>>> +     if (cpu_write_needs_clflush(obj))
>>>>> +             obj->cache_dirty = true;
>>>>
>>>> I find the logic here quite hard to understand because
>>>> cpu_write_needs_clflush is actually consulting obj->cache_dirty so it's
>>>> all a bit recursive.
>>>>
>>>> If we start a cpu write on an object which has been clflushed, but
>>>> otherwise needs clflushing like pin_display - what is the correct thing
>>>> to do? Not set obj->cache_dirty to true?
>>>
>>> No, we want cache_dirty to be true as we want a clflush after the write.
>>
>> Right, but I think it will be false.
>>
>> 0. pin_display=true
>> 1. clflush from somewhere: cache_dirty=false
>> 2. __start_write: cpu_write_needs_clflush returns false ->
>> cache_dirty=false after __start_write
> 
> ? cpu_write_needs_clflush() returns true if either the object is not
> cache coherent or pin_display is set. It's that we return true for
> pin_display being set that is the big hammer.

My bad.. was imagining things. :(

>>> Really that obj->pin_display only exists there for the transition inside
>>> set-cache-level for i915_gem_object_pin_to_display_plane. Outside of
>>> set-cache-level obj->cache_coherent will be accurate. I can pencil in
>>> making yet another change to limit use of obj->pin_display again.
>>
>> Like modifying obj->cache_coherent on pin/unpin to display?
> 
> Yes, we just have to consider when to mark the cache as dirty for the
> clflush and for that we take pin_display into consideration (mostly for
> paranoia).
> 
>>>>> +}
>>>>> +
>>>>>     static void
>>>>>     __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>>>>                                 struct sg_table *pages,
>>>>> @@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>>>>             !i915_gem_object_is_coherent(obj))
>>>>>                 drm_clflush_sg(pages);
>>>>>     
>>>>> -     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>>>> -     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>>>> +     __start_cpu_write(obj);
>>>>>     }
>>>>>     
>>>>>     static void
>>>>> @@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
>>>>>                                args->size, &args->handle);
>>>>>     }
>>>>>     
>>>>> +static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>>>> +{
>>>>> +     return !(obj->cache_level == I915_CACHE_NONE ||
>>>>> +              obj->cache_level == I915_CACHE_WT);
>>>>
>>>> Hm, is this reversed? On LLC I would assume we don't need explicit
>>>> clflush, but perhaps I am misunderstanding something.
>>>
>>> It's accurate. We all shared the same misunderstanding for several
>>> years. The gpu writes via the ring (and the shared llc) so if we
>>> transition to accessing the memory directly from the cpu, we bypass the
>>> shared llc and our data is stale. So after using LLC, be that either
>>> with CPU writes or GPU writes, we need to clflush.
>>
>> Why does accessing the memory from the CPU bypasses the LLC? Or you mean
>> if we access in a way that bypasses the LLC and if so which ways are that?
> 
> Any uncached access (WC from CPU or from scanout) bypass the LLC.

Think I got it now, thanks.

>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * Creates a new mm object and returns a handle to it.
>>>>>      * @dev: drm device pointer
>>>>> @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>>>>>         case I915_GEM_DOMAIN_CPU:
>>>>>                 i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
>>>>>                 break;
>>>>> +
>>>>> +     case I915_GEM_DOMAIN_RENDER:
>>>>> +             if (gpu_write_needs_clflush(obj))
>>>>> +                     obj->cache_dirty = true;
>>>>> +             break;
>>>>>         }
>>>>>     
>>>>>         obj->base.write_domain = 0;
>>>>> @@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
>>>>>          * optimizes for the case when the gpu will dirty the data
>>>>>          * anyway again before the next pread happens.
>>>>>          */
>>>>> -     if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>>>> +     if (!obj->cache_dirty &&
>>>>> +         !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>>>>                 *needs_clflush = CLFLUSH_BEFORE;
>>>>
>>>> Obviously I don't understand something critical here since once more I
>>>> was expecting the reverse here - if the cache is not dirty no need to
>>>> flush it.
>>>
>>> This is an invalidate. We need to make sure that any stray cacheline the
>>> CPU pulled in are invalidated prior to accessing the physical address
>>> which was written to by the GPU out of sight of the cpu.
>>
>> Why we don't need to do that if CPU cache is dirty?
> 
> If the CPU cache is dirty it already means we have passed the invalidate
> phase and own the cache. We will be doing the clflush at the completion
> of the write phase.

Okay, got this one as well. There is what looks to be a stale comment 
talking about moving to gtt above this block btw.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko