[Intel-gfx] [passive aggressive RESEND 1/2] drm/i915: Mark CPU cache as dirty on every transition for CPU writes

Fri Jun 16 12:05:21 UTC 2017

On 16/06/2017 12:26, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-06-15 15:28:10)
>>
>> On 15/06/2017 13:38, Chris Wilson wrote:
>>> Currently, we only mark the CPU cache as dirty if we skip a clflush.
>>> This leads to some confusion where we have to ask if the object is in
>>> the write domain or missed a clflush. If we always mark the cache as
>>> dirty, this becomes a much simply question to answer.
>>>
>>> The goal remains to do as few clflushes as required and to do them as
>>> late as possible, in the hope of deferring the work to a kthread and not
>>> block the caller (e.g. execbuf, flips).
>>>
>>> v2: Always call clflush before GPU execution when the cache_dirty flag
>>> is set. This may cause some extra work on llc systems that migrate dirty
>>> buffers back and forth - but we do try to limit that by only settin
>>
>> settin*g*
>>
>> Subject sounded like reviewer is wanted so I thought to give it a try.
>>
>>> cache_dirty at the end of the gpu sequence.
>>>
>>> v3: Always mark the cache as dirty upon a level change, as we need to
>>> invalidate any stale cachelines due to external writes.
>>>
>>> Reported-by: Dongwon Kim <dongwon.kim at intel.com>
>>> Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Dongwon Kim <dongwon.kim at intel.com>
>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>> Tested-by: Dongwon Kim <dongwon.kim at intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem.c                  | 76 ++++++++++++++----------
>>>    drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
>>>    drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
>>>    drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
>>>    drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
>>>    drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
>>>    6 files changed, 67 insertions(+), 56 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 31cbe78171a9..b1504a829c6a 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
>>>    
>>>    static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>>    {
>>> -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
>>> +     if (obj->cache_dirty)
>>>                return false >
>>>        if (!i915_gem_object_is_coherent(obj))
>>> @@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
>>>        return st;
>>>    }
>>>    
>>> +static void __start_cpu_write(struct drm_i915_gem_object *obj)
>>> +{
>>> +     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> +     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> +     if (cpu_write_needs_clflush(obj))
>>> +             obj->cache_dirty = true;
>>
>> I find the logic here quite hard to understand because
>> cpu_write_needs_clflush is actually consulting obj->cache_dirty so it's
>> all a bit recursive.
>>
>> If we start a cpu write on an object which has been clflushed, but
>> otherwise needs clflushing like pin_display - what is the correct thing
>> to do? Not set obj->cache_dirty to true?
> 
> No, we want cache_dirty to be true as we want a clflush after the write.

Right, but I think it will be false.

0. pin_display=true
1. clflush from somewhere: cache_dirty=false
2. __start_write: cpu_write_needs_clflush returns false -> 
cache_dirty=false after __start_write

> Really that obj->pin_display only exists there for the transition inside
> set-cache-level for i915_gem_object_pin_to_display_plane. Outside of
> set-cache-level obj->cache_coherent will be accurate. I can pencil in
> making yet another change to limit use of obj->pin_display again.

Like modifying obj->cache_coherent on pin/unpin to display?

>>> +}
>>> +
>>>    static void
>>>    __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>>                                struct sg_table *pages,
>>> @@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>>            !i915_gem_object_is_coherent(obj))
>>>                drm_clflush_sg(pages);
>>>    
>>> -     obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> -     obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> +     __start_cpu_write(obj);
>>>    }
>>>    
>>>    static void
>>> @@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
>>>                               args->size, &args->handle);
>>>    }
>>>    
>>> +static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>> +{
>>> +     return !(obj->cache_level == I915_CACHE_NONE ||
>>> +              obj->cache_level == I915_CACHE_WT);
>>
>> Hm, is this reversed? On LLC I would assume we don't need explicit
>> clflush, but perhaps I am misunderstanding something.
> 
> It's accurate. We all shared the same misunderstanding for several
> years. The gpu writes via the ring (and the shared llc) so if we
> transition to accessing the memory directly from the cpu, we bypass the
> shared llc and our data is stale. So after using LLC, be that either
> with CPU writes or GPU writes, we need to clflush.

Why does accessing the memory from the CPU bypasses the LLC? Or you mean 
if we access in a way that bypasses the LLC and if so which ways are that?

> 
>>> +}
>>> +
>>>    /**
>>>     * Creates a new mm object and returns a handle to it.
>>>     * @dev: drm device pointer
>>> @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>>>        case I915_GEM_DOMAIN_CPU:
>>>                i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
>>>                break;
>>> +
>>> +     case I915_GEM_DOMAIN_RENDER:
>>> +             if (gpu_write_needs_clflush(obj))
>>> +                     obj->cache_dirty = true;
>>> +             break;
>>>        }
>>>    
>>>        obj->base.write_domain = 0;
>>> @@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
>>>         * optimizes for the case when the gpu will dirty the data
>>>         * anyway again before the next pread happens.
>>>         */
>>> -     if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>> +     if (!obj->cache_dirty &&
>>> +         !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>>                *needs_clflush = CLFLUSH_BEFORE;
>>
>> Obviously I don't understand something critical here since once more I
>> was expecting the reverse here - if the cache is not dirty no need to
>> flush it.
> 
> This is an invalidate. We need to make sure that any stray cacheline the
> CPU pulled in are invalidated prior to accessing the physical address
> which was written to by the GPU out of sight of the cpu.

Why we don't need to do that if CPU cache is dirty?

>>> +      */
>>> +     flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>>> +     if (obj->cache_dirty)
>>> +             i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
>>>        obj->base.write_domain = 0;
>>>    }
>>>    
>>> @@ -3657,13 +3681,10 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>>>                }
>>>        }
>>>    
>>> -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
>>> -         i915_gem_object_is_coherent(obj))
>>> -             obj->cache_dirty = true;
>>> -
>>>        list_for_each_entry(vma, &obj->vma_list, obj_link)
>>>                vma->node.color = cache_level;
>>>        obj->cache_level = cache_level;
>>> +     obj->cache_dirty = true; /* Always invalidate stale cachelines */
>>>    
>>>        return 0;
>>>    }
>>> @@ -3885,9 +3906,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>>>        if (ret)
>>>                return ret;
>>>    
>>> -     if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
>>> -             return 0;
>>> -
>>>        flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>>>    
>>>        /* Flush the CPU cache if it's still invalid. */
>>> @@ -3899,15 +3917,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>>>        /* It should now be out of any other write domains, and we can update
>>>         * the domain values for our changes.
>>>         */
>>> -     GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
>>> +     GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
>>>    
>>>        /* If we're writing through the CPU, then the GPU read domains will
>>>         * need to be invalidated at next use.
>>>         */
>>> -     if (write) {
>>> -             obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> -             obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> -     }
>>> +     if (write)
>>> +             __start_cpu_write(obj);
>>>    
>>>        return 0;
>>>    }
>>> @@ -4328,6 +4344,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
>>>        } else
>>>                obj->cache_level = I915_CACHE_NONE;
>>>    
>>> +     obj->cache_dirty = !i915_gem_object_is_coherent(obj);
>>
>> Could this, here and elsewhere, together with setting of read and write
>> cpu domains, be consolidated to i915_gem_object_init?
> 
> Possibly, I did consider it, but the freedom in which callers mixed the
> domains reduced the desire to refactor.

Ok.

Regards,

Tvrtko