[PATCH] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Wang, YuBiao YuBiao.Wang at amd.com
Wed Mar 8 03:52:11 UTC 2023


Hi @Tuikov, Luben,

>> >> we have to force signal the non_sched bad job's fence during 
>> >> pre_asic_reset or the clear is not complete.
>>Do you want to add a function which does just this (signals non-scheduler job fences) in amdgpu_device_pre_asic_reset(), and resubmit your patch? (There will be code redundancy, but may make the point clearer.)

On second thought, we cannot move this part out of amdgpu_fence_driver_clear_job_fences, because it clears pointer from RCU and after that only NULL and vm flush fence are left in fence drv, which is designed for reason I'm not aware of. This is also why amdgpu_fence_driver_force_completion doesn't handle the ib test bad jobs during pre_asic_reset. So even if I add a new function I cannot move it to pre_asic_reset as a single step. Also it may be confusing since the name and description could be very closed to amdgpu_fence_driver_force_completion.
So rather than adding a function, is it ok to just add a comment?

Regards & Thanks,
Yubiao

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Wang, YuBiao
Sent: Wednesday, March 8, 2023 11:09 AM
To: Tuikov, Luben <Luben.Tuikov at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Chen, Horace <Horace.Chen at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: RE: [PATCH] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Hi Luben,

I was comparing the bad jobs of failed ib test and the ones that causes the TDR, and I think the main difference is whether it is submitted via drm_sched or not. In simple test cases it doesn't seem to incorrectly signal the fences that shouldn't be signaled. We indeed may need more heavier tests but so far based on static analyze I think I didn't notice the case you mentioned. There's another case using direct job submission during resete, but it happens in recover_vram which happens after the pre_asic reset so I think it won’t be affected.

I'll move this lines into a new function as you suggested and resent a v2 patch.

Regards,
Yubiao Wang

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov at amd.com> 
Sent: Wednesday, March 8, 2023 7:22 AM
To: Koenig, Christian <Christian.Koenig at amd.com>; Wang, YuBiao <YuBiao.Wang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

On 2023-03-07 15:36, Luben Tuikov wrote:
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Thinking about this more, is !job->base.s_fence condition here enough to mean "non-sched jobs like ib_test"?

I feel that it is a bit overloaded here--could we have this condition satisfied,yet we can't willy-nilly signal the fence here?
--
Regards,
Luben



More information about the amd-gfx mailing list