[Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

Fri Jul 21 04:28:56 UTC 2023

>>> [snip]
>>>>> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct
>>>>> drm_i915_gem_object *obj)
>>>>
>>>> The code change here looks accurate, but while we're here, I have a
>>>> side question about this function in general...it was originally
>>>> introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as
>>>> dirty when used for
>>>> rendering") which states that GPU rendering ends up in the CPU cache
>>>> (and thus needs a clflush later to make sure it lands in memory).
>>>> That makes sense to me for LLC platforms, but is it really true for
>>>> non-LLC snooping platforms (like MTL) as the commit states?
>>>
>>> For non-LLC platforms objects can be set to 1-way coherent which
>>> means GPU rendering ending up in CPU cache as well, so for non-LLC
>>> platform the logic here should be checking 1-way coherent flag.
>>
>> That's the part that I'm questioning (and not just for MTL, but for
>> all of our other non-LLC platforms too).  Just because there's
>> coherency doesn't mean that device writes landed in the CPU cache.
>> Coherency is also achieved if device writes invalidate the contents of the CPU cache.
>> I thought our non-LLC snooping platforms were coherent due to
>> write-invalidate rather than write-update, but I can't find it
>> specifically documented anywhere at the moment.  If write-invalidate
>> was used, then there shouldn't be a need for a later clflush either.
>
> [Trying to consolidate by doing a combined reply to the discussion so far.]
>
> On the write-invalidate vs write-update I don't know. If you did not
> find it in bspec then I doubt I would. I can have a browse still.

Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to
CPU cache, it simply snoop CPU cache on its way to RAM."

>>>> My understanding
>>>> was that snooping platforms just invalidated the CPU cache to
>>>> prevent future CPU reads from seeing stale data but didn't actually
>>>> stick any new data in there?  Am I off track or is the original
>>>> logic of this function not quite right?
>>>>
>>>> Anyway, even if the logic of this function is wrong, it's a mistake
>>>> that would only hurt performance
>>>
>>> Yes, this logic will introduce performance impact because it's
>>> missing the checking for obj->pat_set_by_user. For objects with
>>> pat_set_by_user==true, even if the object is snooping or 1-way
>>> coherent, we don't want to enforce a clflush here since the
>>> coherency is supposed to be handled by user space.
>
> What should I add you think to fix it?

I think the simplest would be

        if (obj->pat_set_by_user)
                return false;

because even checking for incoherent WB is unnecessary, simply no
need for the KMD to initiate a flush if PAT is set by user.

> Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning false?
>
> And then if Matt is correct write-invalidate is used also !HAS_LLC should just return false?
>
>>>> (flushing more often than we truly need to) rather than
>>>> functionality, so not something we really need to dig into right now
>>>> as part of this patch.
>>>>
>>>>>       if (IS_DGFX(i915))
>>>>>               return false;
>>>>>
>>>>> -    /*
>>>>> -     * For objects created by userspace through GEM_CREATE with pat_index
>>>>> -     * set by set_pat extension, i915_gem_object_has_cache_level() will
>>>>> -     * always return true, because the coherency of such object is managed
>>>>> -     * by userspace. Othereise the call here would fall back to checking
>>>>> -     * whether the object is un-cached or write-through.
>>>>> -     */
>>>>> -    return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
>>>>> -             i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
>>>>> +    return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
>>>>> +           i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
>>>>>   }
>>>
>>> [snip]
>>>>> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
>>>>>       if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
>>>>>               return false;
>>>>>
>>>>> -    /*
>>>>> -     * For objects created by userspace through GEM_CREATE with pat_index
>>>>> -     * set by set_pat extension, i915_gem_object_has_cache_level() always
>>>>> -     * return true, otherwise the call would fall back to checking whether
>>>>> -     * the object is un-cached.
>>>>> -     */
>>>>>       return (cache->has_llc ||
>>>>>               obj->cache_dirty ||
>>>>> -            !i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
>>>>> +            i915_gem_object_has_cache_mode(obj,
>>>>> + I915_CACHE_MODE_UC) != 1);
>>>>
>>>> Platforms with relocations and platforms with user-specified PAT
>>>> have no overlap, right?  So a -1 return should be impossible here
>>>> and this is one case where we could just treat the return value as
>>>> a boolean, right?
>>>
>
> Hm no, or maybe. My thinking behind tri-state is to allow a safe option
> for "don't know". In case PAT index to cache mode table is not fully
> populated on some future platform.

That would be a problem in the cache mode table. At least max_pat_index
should have guaranteed the PAT index is sane.

-Fei