[PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

Fri Sep 16 13:51:32 UTC 2022

Am 16.09.22 um 15:50 schrieb Andrey Grodzovsky:
>
> On 2022-09-16 01:18, Christian König wrote:
>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-09-15 15:26, Christian König wrote:
>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> The test sequence is executing a compute engine hang while 
>>>>>> running a lot of containers submitting gfx jobs. We have advanced 
>>>>>> tdr mode and mode2 reset enabled on driver.
>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>>> pending list maybe signaled after drm_sched_stop. So they will 
>>>>>> not be removed from pending list but have the 
>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>>> be rerun and removed from pending list.
>>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>>> the signal of some job will be skipped which result in an 
>>>>>> infinite wait for the fence_drv rcu ptr.
>>>>>>
>>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>>> drm_sched_stop. And signal job one by one in fence_process 
>>>>>> instead of using a mask will handle the overflow situation.
>>>>>>
>>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>>> during resubmit stage, which may look cleaner.
>>>>>>
>>>>>> Please help give some advice.
>>>>>
>>>>>
>>>>> How about the code bellow  instead ? The real problem is that we 
>>>>> reuse a dma fence twice which is not according to fma fence 
>>>>> design, so maybe this can help ?
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>         if (job && job->job_run_counter) {
>>>>>                 /* reinit seq for resubmitted jobs */
>>>>>                 fence->seqno = seq;
>>>>> +
>>>>> +               /* For resubmitted job clear the singled bit */
>>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, 
>>>>> &fence->flags);
>>>>> +
>>>>
>>>> Upstream will pretty much kill you for that.
>>>>
>>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>>> massive no-go.
>>>>
>>>> Christian.
>>>
>>>
>>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>>> hack , no ?
>>
>> No, it's as equally bad. I don't think we can do either.
>>
>> Christian.
>
>
> And all those ugly hack are there because we reuse a dma_fence 
> (hw_fence embedded into the job) and correct me if I am wrong
> but I don't think dma_fence is ever supposed to be reused.

Exactly that, yes.

Christian.

>
> So maybe like Victor suggested we should move close and flush irq 
> before sched_stop - this in my opinion should solve the issue, but 
> Victor - why then you still need the change in amdgpu_fence_process ? 
> You will not have the overflow situation because by moving irq_disable 
> before stop any job that signaled will be removed from the scheduler 
> pending list anyway. Also not that this change reverts 'drm/amdgpu: 
> sanitize fence numbers' and could reintroduce that bug.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>>                 /* TO be inline with external fence creation and 
>>>>> other drivers */
>>>>>                 dma_fence_get(fence);
>>>>>         } else {
>>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>>> To: Zhao, Victor <Victor.Zhao at amd.com>; 
>>>>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey 
>>>>>> <Andrey.Grodzovsky at amd.com>
>>>>>> Cc: Deng, Emily <Emily.Deng at amd.com>
>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>>> [AMD Official Use Only - General]
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>>
>>>>>>> We found some reset related issues during stress test on the 
>>>>>>> sequence. Please help give some comments.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Victor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Victor Zhao <Victor.Zhao at amd.com>
>>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>> Cc: Deng, Emily <Emily.Deng at amd.com>; Grodzovsky, Andrey
>>>>>>> <Andrey.Grodzovsky at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>
>>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>>
>>>>>>> [background]
>>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>>
>>>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>>>> signaled bit set will call job done directly instead go through 
>>>>>>> fence process.
>>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>>>> cleared yet.
>>>>>>>
>>>>>>> Then overflow happens in the fence driver slots and some jobs 
>>>>>>> may be skipped and leave the rcu pointer not cleared which makes 
>>>>>>> an infinite wait for the slot on the next fence emitted.
>>>>>>>
>>>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>>>> driver will stuck at the its sched stop step because 
>>>>>>> kthread_park cannot be done.
>>>>>>>
>>>>>>> [how]
>>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>>>> before drm sched stop 2. handle all fences in fence process to 
>>>>>>> aviod
>>>>>>> skip when overflow happens
>>>>>>>
>>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>>>>> +-----
>>>>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>>>>        }
>>>>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> -
>>>>>>>        /* block all schedulers and reset given job's ring */
>>>>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>>>>>> *adev,
>>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>>> +
>>>>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>>        }
>>>>>>>    -    if (need_emergency_restart)
>>>>>>> +    if (need_emergency_restart) {
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>            goto skip_sched_resume;
>>>>>>> +    }
>>>>>>>           /*
>>>>>>>         * Must check guilty signal here since after this point 
>>>>>>> all old @@ -5240,6 +5244,9 @@ int 
>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>            job_signaled = true;
>>>>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>>>>> skipping HW
>>>>>>> reset");
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>            goto skip_hw_reset;
>>>>>>>        }
>>>>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>            if (r && r == -EAGAIN) {
>>>>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>>>> &reset_context->flags);
>>>>>>>                adev->asic_reset_res = 0;
>>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>                goto retry;
>>>>>>>            }
>>>>>>>        }
>>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> +
>>>>>>>        adev->no_hw_access = true;
>>>>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>>        adev->no_hw_access = false;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct 
>>>>>>> amdgpu_ring *ring)
>>>>>>>        if (unlikely(seq == last_seq))
>>>>>>>            return false;
>>>>>>>    -    last_seq &= drv->num_fences_mask;
>>>>>>> -    seq &= drv->num_fences_mask;
>>>>>>> -
>>>>>>>        do {
>>>>>>>            struct dma_fence *fence, **ptr;
>>>>>>>               ++last_seq;
>>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>>               /* There is always exactly one thread signaling 
>>>>>>> this fence slot */
>>>>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>>>>> Those changes here doesn't seem to make sense. Please explain 
>>>>>> further why that is necessary.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> -- 
>>>>>>> 2.25.1
>>>>
>>