[Intel-gfx] [PATCH v2 4/4] drm/i915/: Re-work clflush_write32

Wed Feb 2 16:35:06 UTC 2022

As far as I know, we haven't gotten to the arm implementation yet, since 
we are trying to get i915 compile for arm without using random ifdefs 
and dummy functions.

"Noob question - why is i915 the only driver calling it? Do other GPUs 
never need to flush CPU cache?"

Unfortunately I don't have enough expertise to comfortable answer this 
question. Maybe someone else can chime in here? Lucas? Matt?

On 2022-02-01 8:32 a.m., Tvrtko Ursulin wrote:
>
> On 01/02/2022 15:41, Michael Cheng wrote:
>> Ah, thanks for the clarification! While discussion goes on about the 
>> route you suggested, could we land these patches (after addressing 
>> the reviews) to unblock compiling i915 on arm?
>
> I am 60-40 to no, since follow up can be hard. I'd prefer a little bit 
> of discussion before merging.
>
> Also, what will be the Arm implementation of drm_clflush_virt_range? 
> Noob question - why is i915 the only driver calling it? Do other GPUs 
> never need to flush CPU cache?
>
> Regards,
>
> Tvrtko
>
>> On 2022-02-01 1:25 a.m., Tvrtko Ursulin wrote:
>>>
>>> On 31/01/2022 17:02, Michael Cheng wrote:
>>>> Hey Tvrtko,
>>>>
>>>> Are you saying when adding drm_clflush_virt_range(addr, 
>>>> sizeof(addr), this function forces an x86 code path only? If that 
>>>> is the case, drm_clflush_virt_range(addr, sizeof(addr) currently 
>>>> has ifdefs that seperate out x86 and powerpc, so we can add an 
>>>> ifdef for arm in the near future when needed.
>>>
>>> No, I was noticing that the change you are making in this patch, 
>>> while it indeed fixes a build failure, it is a code path which does 
>>> not get executed on Arm at all.
>>>
>>> So what effectively happens is a single assembly instruction gets 
>>> replaced with a function call on all integrated GPUs up to and 
>>> including Tigerlake.
>>>
>>> That was the slightly annoying part I was referring to and asking 
>>> whether it was discussed before.
>>>
>>> Sadly I don't think there is a super nice solution apart from 
>>> duplicating drm_clflush_virt_range as for example i915_clflush_range 
>>> and having it static inline. That would allow the integrated GPU 
>>> code path to remain of the same performance profile, while solving 
>>> the Arm problem. However it would be code duplication so might be 
>>> frowned upon.
>>>
>>> I'd be tempted to go that route but it is something which needs a 
>>> bit of discussion if that hasn't happened already.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> On 2022-01-31 6:55 a.m., Tvrtko Ursulin wrote:
>>>>> On 28/01/2022 22:10, Michael Cheng wrote:
>>>>>> Use drm_clflush_virt_range instead of clflushopt and remove the 
>>>>>> memory
>>>>>> barrier, since drm_clflush_virt_range takes care of that.
>>>>>>
>>>>>> Signed-off-by: Michael Cheng <michael.cheng at intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 8 +++-----
>>>>>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>>>> index 498b458fd784..0854276ff7ba 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>>>> @@ -1332,10 +1332,8 @@ static void *reloc_vaddr(struct i915_vma 
>>>>>> *vma,
>>>>>>   static void clflush_write32(u32 *addr, u32 value, unsigned int 
>>>>>> flushes)
>>>>>>   {
>>>>>>       if (unlikely(flushes & (CLFLUSH_BEFORE | CLFLUSH_AFTER))) {
>>>>>> -        if (flushes & CLFLUSH_BEFORE) {
>>>>>> -            clflushopt(addr);
>>>>>> -            mb();
>>>>>> -        }
>>>>>> +        if (flushes & CLFLUSH_BEFORE)
>>>>>> +            drm_clflush_virt_range(addr, sizeof(addr));
>>>>>>             *addr = value;
>>>>>>   @@ -1347,7 +1345,7 @@ static void clflush_write32(u32 *addr, 
>>>>>> u32 value, unsigned int flushes)
>>>>>>            * to ensure ordering of clflush wrt to the system.
>>>>>>            */
>>>>>>           if (flushes & CLFLUSH_AFTER)
>>>>>> -            clflushopt(addr);
>>>>>> +            drm_clflush_virt_range(addr, sizeof(addr));
>>>>>>       } else
>>>>>>           *addr = value;
>>>>>>   }
>>>>>
>>>>> Slightly annoying thing here (maybe in some other patches from the 
>>>>> series as well) is that the change adds a function call to x86 
>>>>> only code path, because relocations are not supported on discrete 
>>>>> as per:
>>>>>
>>>>> static in
>>>>> eb_validate_vma(...)
>>>>>         /* Relocations are disallowed for all platforms after 
>>>>> TGL-LP. This
>>>>>          * also covers all platforms with local memory.
>>>>>          */
>>>>>
>>>>>         if (entry->relocation_count &&
>>>>>             GRAPHICS_VER(eb->i915) >= 12 && !IS_TIGERLAKE(eb->i915))
>>>>>                 return -EINVAL;
>>>>>
>>>>> How acceptable would be, for the whole series, to introduce a 
>>>>> static inline i915 cluflush wrapper and so be able to avoid 
>>>>> functions calls on x86? Is this something that has been discussed 
>>>>> and discounted already?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>> P.S. Hmm I am now reminded of my really old per platform build 
>>>>> patches. With them you would be able to compile out large portions 
>>>>> of the driver when building for ARM. Probably like a 3rd if my 
>>>>> memory serves me right.