[PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

Thu Sep 15 20:37:49 UTC 2022

On 2022-09-15 15:26, Christian König wrote:
> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Christian,
>>>
>>> The test sequence is executing a compute engine hang while running a 
>>> lot of containers submitting gfx jobs. We have advanced tdr mode and 
>>> mode2 reset enabled on driver.
>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>> be removed from pending list but have the 
>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
>>> rerun and removed from pending list.
>>> At the resubmit setp, the second job (with signaled bit) will be 
>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>> will be called directly. This decrease the hw_rq_count which allows 
>>> more jobs emitted but did not clean fence_drv rcu ptr.
>>> This results in an overflow in the fence_drv. Since we will use 
>>> num_fences_mask in amdgpu_fence_process, when overflow happens, the 
>>> signal of some job will be skipped which result in an infinite wait 
>>> for the fence_drv rcu ptr.
>>>
>>> So close irq before sched_stop could avoid signal jobs after 
>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>> of using a mask will handle the overflow situation.
>>>
>>> Another fix could be skip submitting jobs which already signaled 
>>> during resubmit stage, which may look cleaner.
>>>
>>> Please help give some advice.
>>
>>
>> How about the code bellow  instead ? The real problem is that we 
>> reuse a dma fence twice which is not according to fma fence design, 
>> so maybe this can help ?
>>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8adeb7469f1e..033f0ae16784 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
>> struct dma_fence **f, struct amd
>>         if (job && job->job_run_counter) {
>>                 /* reinit seq for resubmitted jobs */
>>                 fence->seqno = seq;
>> +
>> +               /* For resubmitted job clear the singled bit */
>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
>> +
>
> Upstream will pretty much kill you for that.
>
> Re-setting a fence from a signaled to an unsignaled state is a massive 
> no-go.
>
> Christian.

Is it worse then doing fence->seqno = seq; ? This is already a huge hack 
, no ?

Andrey

>
>>
>>                 /* TO be inline with external fence creation and 
>> other drivers */
>>                 dma_fence_get(fence);
>>         } else {
>>
>>
>> Andrey
>>
>>
>>>
>>>
>>> Thanks,
>>> Victor
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Thursday, September 15, 2022 2:32 PM
>>> To: Zhao, Victor <Victor.Zhao at amd.com>; 
>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey 
>>> <Andrey.Grodzovsky at amd.com>
>>> Cc: Deng, Emily <Emily.Deng at amd.com>
>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>
>>>
>>>
>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Ping.
>>>>
>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>
>>>> We found some reset related issues during stress test on the 
>>>> sequence. Please help give some comments.
>>>>
>>>>
>>>> Thanks,
>>>> Victor
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Victor Zhao <Victor.Zhao at amd.com>
>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>> To: amd-gfx at lists.freedesktop.org
>>>> Cc: Deng, Emily <Emily.Deng at amd.com>; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>
>>>> [background]
>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>
>>>> At the resubmit stage after recovery, the job with hw fence 
>>>> signaled bit set will call job done directly instead go through 
>>>> fence process.
>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>> cleared yet.
>>>>
>>>> Then overflow happens in the fence driver slots and some jobs may 
>>>> be skipped and leave the rcu pointer not cleared which makes an 
>>>> infinite wait for the slot on the next fence emitted.
>>>>
>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>> driver will stuck at the its sched stop step because kthread_park 
>>>> cannot be done.
>>>>
>>>> [how]
>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>> skip when overflow happens
>>>>
>>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>> +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>> +-----
>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 943c9e750575..c0cfae52f12b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>        }
>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> -
>>>>        /* block all schedulers and reset given job's ring */
>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>                  amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>> +
>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>            atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>        }
>>>>    -    if (need_emergency_restart)
>>>> +    if (need_emergency_restart) {
>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>> reset_list) {
>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>> +        }
>>>>            goto skip_sched_resume;
>>>> +    }
>>>>           /*
>>>>         * Must check guilty signal here since after this point all 
>>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>            job_signaled = true;
>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>> skipping HW
>>>> reset");
>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>> reset_list) {
>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>> +        }
>>>>            goto skip_hw_reset;
>>>>        }
>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>            if (r && r == -EAGAIN) {
>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>> &reset_context->flags);
>>>>                adev->asic_reset_res = 0;
>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>                goto retry;
>>>>            }
>>>>        }
>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct 
>>>> pci_dev *pdev)
>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> +
>>>>        adev->no_hw_access = true;
>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>        adev->no_hw_access = false;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring 
>>>> *ring)
>>>>        if (unlikely(seq == last_seq))
>>>>            return false;
>>>>    -    last_seq &= drv->num_fences_mask;
>>>> -    seq &= drv->num_fences_mask;
>>>> -
>>>>        do {
>>>>            struct dma_fence *fence, **ptr;
>>>>               ++last_seq;
>>>> -        last_seq &= drv->num_fences_mask;
>>>> -        ptr = &drv->fences[last_seq];
>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>               /* There is always exactly one thread signaling this 
>>>> fence slot */
>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>> Those changes here doesn't seem to make sense. Please explain 
>>> further why that is necessary.
>>>
>>> Christian.
>>>
>>>> -- 
>>>> 2.25.1
>