[PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
Christian König
christian.koenig at amd.com
Fri Sep 16 05:18:42 UTC 2022
Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>
> On 2022-09-15 15:26, Christian König wrote:
>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Hi Christian,
>>>>
>>>> The test sequence is executing a compute engine hang while running
>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode
>>>> and mode2 reset enabled on driver.
>>>> When a compute hang job timeout happens, the 2 jobs on the gfx
>>>> pending list maybe signaled after drm_sched_stop. So they will not
>>>> be removed from pending list but have the
>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will
>>>> be rerun and removed from pending list.
>>>> At the resubmit setp, the second job (with signaled bit) will be
>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done
>>>> will be called directly. This decrease the hw_rq_count which allows
>>>> more jobs emitted but did not clean fence_drv rcu ptr.
>>>> This results in an overflow in the fence_drv. Since we will use
>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, the
>>>> signal of some job will be skipped which result in an infinite wait
>>>> for the fence_drv rcu ptr.
>>>>
>>>> So close irq before sched_stop could avoid signal jobs after
>>>> drm_sched_stop. And signal job one by one in fence_process instead
>>>> of using a mask will handle the overflow situation.
>>>>
>>>> Another fix could be skip submitting jobs which already signaled
>>>> during resubmit stage, which may look cleaner.
>>>>
>>>> Please help give some advice.
>>>
>>>
>>> How about the code bellow instead ? The real problem is that we
>>> reuse a dma fence twice which is not according to fma fence design,
>>> so maybe this can help ?
>>>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 8adeb7469f1e..033f0ae16784 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring,
>>> struct dma_fence **f, struct amd
>>> if (job && job->job_run_counter) {
>>> /* reinit seq for resubmitted jobs */
>>> fence->seqno = seq;
>>> +
>>> + /* For resubmitted job clear the singled bit */
>>> + celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
>>> +
>>
>> Upstream will pretty much kill you for that.
>>
>> Re-setting a fence from a signaled to an unsignaled state is a
>> massive no-go.
>>
>> Christian.
>
>
> Is it worse then doing fence->seqno = seq; ? This is already a huge
> hack , no ?
No, it's as equally bad. I don't think we can do either.
Christian.
>
> Andrey
>
>
>>
>>>
>>> /* TO be inline with external fence creation and
>>> other drivers */
>>> dma_fence_get(fence);
>>> } else {
>>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Victor
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>> To: Zhao, Victor <Victor.Zhao at amd.com>;
>>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky at amd.com>
>>>> Cc: Deng, Emily <Emily.Deng at amd.com>
>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>
>>>>
>>>>
>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Ping.
>>>>>
>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>
>>>>> We found some reset related issues during stress test on the
>>>>> sequence. Please help give some comments.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Victor Zhao <Victor.Zhao at amd.com>
>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>> To: amd-gfx at lists.freedesktop.org
>>>>> Cc: Deng, Emily <Emily.Deng at amd.com>; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>
>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>
>>>>> [background]
>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute),
>>>>> jobs from another ring (e.g. gfx) may continue signaling during
>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>
>>>>> At the resubmit stage after recovery, the job with hw fence
>>>>> signaled bit set will call job done directly instead go through
>>>>> fence process.
>>>>> This makes the hw_rq_count decrease but rcu fence pointer not
>>>>> cleared yet.
>>>>>
>>>>> Then overflow happens in the fence driver slots and some jobs may
>>>>> be skipped and leave the rcu pointer not cleared which makes an
>>>>> infinite wait for the slot on the next fence emitted.
>>>>>
>>>>> This infinite wait cause a job timeout on the emitting job. And
>>>>> driver will stuck at the its sched stop step because kthread_park
>>>>> cannot be done.
>>>>>
>>>>> [how]
>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>>> skip when overflow happens
>>>>>
>>>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6
>>>>> +-----
>>>>> 2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>> amdgpu_device *adev,
>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>> }
>>>>> - amdgpu_fence_driver_isr_toggle(adev, true);
>>>>> -
>>>>> /* block all schedulers and reset given job's ring */
>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>> struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6
>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>> amdgpu_ras_suspend(tmp_adev);
>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>> +
>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>> struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>> }
>>>>> - if (need_emergency_restart)
>>>>> + if (need_emergency_restart) {
>>>>> + list_for_each_entry (tmp_adev, device_list_handle,
>>>>> reset_list) {
>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>> + }
>>>>> goto skip_sched_resume;
>>>>> + }
>>>>> /*
>>>>> * Must check guilty signal here since after this point all
>>>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>> job_signaled = true;
>>>>> dev_info(adev->dev, "Guilty job already signaled,
>>>>> skipping HW
>>>>> reset");
>>>>> + list_for_each_entry (tmp_adev, device_list_handle,
>>>>> reset_list) {
>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>> + }
>>>>> goto skip_hw_reset;
>>>>> }
>>>>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> if (r && r == -EAGAIN) {
>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET,
>>>>> &reset_context->flags);
>>>>> adev->asic_reset_res = 0;
>>>>> + amdgpu_fence_driver_isr_toggle(adev, true);
>>>>> goto retry;
>>>>> }
>>>>> }
>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t
>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>> set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>> + amdgpu_fence_driver_isr_toggle(adev, true);
>>>>> +
>>>>> adev->no_hw_access = true;
>>>>> r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>> adev->no_hw_access = false;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring
>>>>> *ring)
>>>>> if (unlikely(seq == last_seq))
>>>>> return false;
>>>>> - last_seq &= drv->num_fences_mask;
>>>>> - seq &= drv->num_fences_mask;
>>>>> -
>>>>> do {
>>>>> struct dma_fence *fence, **ptr;
>>>>> ++last_seq;
>>>>> - last_seq &= drv->num_fences_mask;
>>>>> - ptr = &drv->fences[last_seq];
>>>>> + ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>> /* There is always exactly one thread signaling this
>>>>> fence slot */
>>>>> fence = rcu_dereference_protected(*ptr, 1);
>>>> Those changes here doesn't seem to make sense. Please explain
>>>> further why that is necessary.
>>>>
>>>> Christian.
>>>>
>>>>> --
>>>>> 2.25.1
>>
More information about the amd-gfx
mailing list