[Intel-gfx] [passive aggressive RESEND 1/2] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Jun 16 12:05:21 UTC 2017
On 16/06/2017 12:26, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-06-15 15:28:10)
>>
>> On 15/06/2017 13:38, Chris Wilson wrote:
>>> Currently, we only mark the CPU cache as dirty if we skip a clflush.
>>> This leads to some confusion where we have to ask if the object is in
>>> the write domain or missed a clflush. If we always mark the cache as
>>> dirty, this becomes a much simply question to answer.
>>>
>>> The goal remains to do as few clflushes as required and to do them as
>>> late as possible, in the hope of deferring the work to a kthread and not
>>> block the caller (e.g. execbuf, flips).
>>>
>>> v2: Always call clflush before GPU execution when the cache_dirty flag
>>> is set. This may cause some extra work on llc systems that migrate dirty
>>> buffers back and forth - but we do try to limit that by only settin
>>
>> settin*g*
>>
>> Subject sounded like reviewer is wanted so I thought to give it a try.
>>
>>> cache_dirty at the end of the gpu sequence.
>>>
>>> v3: Always mark the cache as dirty upon a level change, as we need to
>>> invalidate any stale cachelines due to external writes.
>>>
>>> Reported-by: Dongwon Kim <dongwon.kim at intel.com>
>>> Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Dongwon Kim <dongwon.kim at intel.com>
>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>> Tested-by: Dongwon Kim <dongwon.kim at intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem.c | 76 ++++++++++++++----------
>>> drivers/gpu/drm/i915/i915_gem_clflush.c | 15 +++--
>>> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 21 +++----
>>> drivers/gpu/drm/i915/i915_gem_internal.c | 3 +-
>>> drivers/gpu/drm/i915/i915_gem_userptr.c | 5 +-
>>> drivers/gpu/drm/i915/selftests/huge_gem_object.c | 3 +-
>>> 6 files changed, 67 insertions(+), 56 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 31cbe78171a9..b1504a829c6a 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
>>>
>>> static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>> {
>>> - if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
>>> + if (obj->cache_dirty)
>>> return false >
>>> if (!i915_gem_object_is_coherent(obj))
>>> @@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
>>> return st;
>>> }
>>>
>>> +static void __start_cpu_write(struct drm_i915_gem_object *obj)
>>> +{
>>> + obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> + obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> + if (cpu_write_needs_clflush(obj))
>>> + obj->cache_dirty = true;
>>
>> I find the logic here quite hard to understand because
>> cpu_write_needs_clflush is actually consulting obj->cache_dirty so it's
>> all a bit recursive.
>>
>> If we start a cpu write on an object which has been clflushed, but
>> otherwise needs clflushing like pin_display - what is the correct thing
>> to do? Not set obj->cache_dirty to true?
>
> No, we want cache_dirty to be true as we want a clflush after the write.
Right, but I think it will be false.
0. pin_display=true
1. clflush from somewhere: cache_dirty=false
2. __start_write: cpu_write_needs_clflush returns false ->
cache_dirty=false after __start_write
> Really that obj->pin_display only exists there for the transition inside
> set-cache-level for i915_gem_object_pin_to_display_plane. Outside of
> set-cache-level obj->cache_coherent will be accurate. I can pencil in
> making yet another change to limit use of obj->pin_display again.
Like modifying obj->cache_coherent on pin/unpin to display?
>>> +}
>>> +
>>> static void
>>> __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>> struct sg_table *pages,
>>> @@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>>> !i915_gem_object_is_coherent(obj))
>>> drm_clflush_sg(pages);
>>>
>>> - obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> - obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> + __start_cpu_write(obj);
>>> }
>>>
>>> static void
>>> @@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
>>> args->size, &args->handle);
>>> }
>>>
>>> +static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>> +{
>>> + return !(obj->cache_level == I915_CACHE_NONE ||
>>> + obj->cache_level == I915_CACHE_WT);
>>
>> Hm, is this reversed? On LLC I would assume we don't need explicit
>> clflush, but perhaps I am misunderstanding something.
>
> It's accurate. We all shared the same misunderstanding for several
> years. The gpu writes via the ring (and the shared llc) so if we
> transition to accessing the memory directly from the cpu, we bypass the
> shared llc and our data is stale. So after using LLC, be that either
> with CPU writes or GPU writes, we need to clflush.
Why does accessing the memory from the CPU bypasses the LLC? Or you mean
if we access in a way that bypasses the LLC and if so which ways are that?
>
>>> +}
>>> +
>>> /**
>>> * Creates a new mm object and returns a handle to it.
>>> * @dev: drm device pointer
>>> @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>>> case I915_GEM_DOMAIN_CPU:
>>> i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
>>> break;
>>> +
>>> + case I915_GEM_DOMAIN_RENDER:
>>> + if (gpu_write_needs_clflush(obj))
>>> + obj->cache_dirty = true;
>>> + break;
>>> }
>>>
>>> obj->base.write_domain = 0;
>>> @@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
>>> * optimizes for the case when the gpu will dirty the data
>>> * anyway again before the next pread happens.
>>> */
>>> - if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>> + if (!obj->cache_dirty &&
>>> + !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>>> *needs_clflush = CLFLUSH_BEFORE;
>>
>> Obviously I don't understand something critical here since once more I
>> was expecting the reverse here - if the cache is not dirty no need to
>> flush it.
>
> This is an invalidate. We need to make sure that any stray cacheline the
> CPU pulled in are invalidated prior to accessing the physical address
> which was written to by the GPU out of sight of the cpu.
Why we don't need to do that if CPU cache is dirty?
>>> + */
>>> + flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>>> + if (obj->cache_dirty)
>>> + i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
>>> obj->base.write_domain = 0;
>>> }
>>>
>>> @@ -3657,13 +3681,10 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>>> }
>>> }
>>>
>>> - if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
>>> - i915_gem_object_is_coherent(obj))
>>> - obj->cache_dirty = true;
>>> -
>>> list_for_each_entry(vma, &obj->vma_list, obj_link)
>>> vma->node.color = cache_level;
>>> obj->cache_level = cache_level;
>>> + obj->cache_dirty = true; /* Always invalidate stale cachelines */
>>>
>>> return 0;
>>> }
>>> @@ -3885,9 +3906,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>>> if (ret)
>>> return ret;
>>>
>>> - if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
>>> - return 0;
>>> -
>>> flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>>>
>>> /* Flush the CPU cache if it's still invalid. */
>>> @@ -3899,15 +3917,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>>> /* It should now be out of any other write domains, and we can update
>>> * the domain values for our changes.
>>> */
>>> - GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
>>> + GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
>>>
>>> /* If we're writing through the CPU, then the GPU read domains will
>>> * need to be invalidated at next use.
>>> */
>>> - if (write) {
>>> - obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>>> - obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>>> - }
>>> + if (write)
>>> + __start_cpu_write(obj);
>>>
>>> return 0;
>>> }
>>> @@ -4328,6 +4344,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
>>> } else
>>> obj->cache_level = I915_CACHE_NONE;
>>>
>>> + obj->cache_dirty = !i915_gem_object_is_coherent(obj);
>>
>> Could this, here and elsewhere, together with setting of read and write
>> cpu domains, be consolidated to i915_gem_object_init?
>
> Possibly, I did consider it, but the freedom in which callers mixed the
> domains reduced the desire to refactor.
Ok.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list