[PATCH] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

Thu Oct 24 09:15:22 UTC 2024

On 10/24/2024 4:02 AM, Ghimiray, Himal Prasad wrote:
>
>
> On 23-10-2024 20:43, Nirmoy Das wrote:
>> Flush the g2h worker explicitly if TLB timeout happens which is
>> observed on LNL and that points recent scheduling issue with E-cores.
>> This is similar to the recent fix:
>> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
>> response timeout") and should be removed once there is E core
>> scheduling fix.
>>
>> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
>
>
> The issue is not only limited to LNL but is also observed on BMG and
> DG2. As far as I know, other host CPUs are not eECORES, so the reason
> for failure on BMG won’t be the same.

BMG and DG2 could be running on Alderlake or later with E cores but I hope we don't have the

same scheduling issue like LNL.

On DG2 and BMG, the timeout happens after GT suspend which shouldn't as we hold the

pm reference so that is a different issue that I have to look into.

> In my opinion, we should limit
> this workaround to LNL and continue debugging BMG to find the root
> cause.
>
> Probably it will be better to add platform check even on e51527233804.

Yes, platform check is a good idea. I will add that.

>
> On BMG and DG2:
> https://patchwork.freedesktop.org/series/140267/ series from Matt might help solve this.

I remember for E core scheduling issue above change didn't help.

Thanks,

Nirmoy

>
> BR
> Himal
>
>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Matthew Auld <matthew.auld at intel.com>
>> Cc: John Harrison <John.C.Harrison at Intel.com>
>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> index 773de1f08db9..2c327dccbd74 100644
>> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> @@ -72,6 +72,15 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
>>       struct xe_device *xe = gt_to_xe(gt);
>>       struct xe_gt_tlb_invalidation_fence *fence, *next;
>>   +    /*
>> +     * This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h worker
>> +     * in case of g2h response timeout")
>> +     *
>> +     * TODO: Drop this change once workqueue scheduling delay issue is
>> +     * fixed on LNL Hybrid CPU.
>> +     */
>> +    flush_work(&gt->uc.guc.ct.g2h_worker);
>> +
>>       spin_lock_irq(&gt->tlb_invalidation.pending_lock);
>>       list_for_each_entry_safe(fence, next,
>>                    &gt->tlb_invalidation.pending_fences, link) {
>