[PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Fri Sep 16 13:50:09 UTC 2022
On 2022-09-16 01:18, Christian König wrote:
> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 15:26, Christian König wrote:
>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> The test sequence is executing a compute engine hang while running
>>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode
>>>>> and mode2 reset enabled on driver.
>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx
>>>>> pending list maybe signaled after drm_sched_stop. So they will not
>>>>> be removed from pending list but have the
>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will
>>>>> be rerun and removed from pending list.
>>>>> At the resubmit setp, the second job (with signaled bit) will be
>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done
>>>>> will be called directly. This decrease the hw_rq_count which
>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>> This results in an overflow in the fence_drv. Since we will use
>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens,
>>>>> the signal of some job will be skipped which result in an infinite
>>>>> wait for the fence_drv rcu ptr.
>>>>>
>>>>> So close irq before sched_stop could avoid signal jobs after
>>>>> drm_sched_stop. And signal job one by one in fence_process instead
>>>>> of using a mask will handle the overflow situation.
>>>>>
>>>>> Another fix could be skip submitting jobs which already signaled
>>>>> during resubmit stage, which may look cleaner.
>>>>>
>>>>> Please help give some advice.
>>>>
>>>>
>>>> How about the code bellow instead ? The real problem is that we
>>>> reuse a dma fence twice which is not according to fma fence design,
>>>> so maybe this can help ?
>>>>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>> *ring, struct dma_fence **f, struct amd
>>>> if (job && job->job_run_counter) {
>>>> /* reinit seq for resubmitted jobs */
>>>> fence->seqno = seq;
>>>> +
>>>> + /* For resubmitted job clear the singled bit */
>>>> + celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>> &fence->flags);
>>>> +
>>>
>>> Upstream will pretty much kill you for that.
>>>
>>> Re-setting a fence from a signaled to an unsignaled state is a
>>> massive no-go.
>>>
>>> Christian.
>>
>>
>> Is it worse then doing fence->seqno = seq; ? This is already a huge
>> hack , no ?
>
> No, it's as equally bad. I don't think we can do either.
>
> Christian.
And all those ugly hack are there because we reuse a dma_fence (hw_fence
embedded into the job) and correct me if I am wrong
but I don't think dma_fence is ever supposed to be reused.
So maybe like Victor suggested we should move close and flush irq before
sched_stop - this in my opinion should solve the issue, but Victor - why
then you still need the change in amdgpu_fence_process ? You will not
have the overflow situation because by moving irq_disable before stop
any job that signaled will be removed from the scheduler pending list
anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence
numbers' and could reintroduce that bug.
Andrey
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> /* TO be inline with external fence creation and
>>>> other drivers */
>>>> dma_fence_get(fence);
>>>> } else {
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>> To: Zhao, Victor <Victor.Zhao at amd.com>;
>>>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky at amd.com>
>>>>> Cc: Deng, Emily <Emily.Deng at amd.com>
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>
>>>>>
>>>>>
>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Ping.
>>>>>>
>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>
>>>>>> We found some reset related issues during stress test on the
>>>>>> sequence. Please help give some comments.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Victor Zhao <Victor.Zhao at amd.com>
>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>> Cc: Deng, Emily <Emily.Deng at amd.com>; Grodzovsky, Andrey
>>>>>> <Andrey.Grodzovsky at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>
>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>
>>>>>> [background]
>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute),
>>>>>> jobs from another ring (e.g. gfx) may continue signaling during
>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>
>>>>>> At the resubmit stage after recovery, the job with hw fence
>>>>>> signaled bit set will call job done directly instead go through
>>>>>> fence process.
>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not
>>>>>> cleared yet.
>>>>>>
>>>>>> Then overflow happens in the fence driver slots and some jobs may
>>>>>> be skipped and leave the rcu pointer not cleared which makes an
>>>>>> infinite wait for the slot on the next fence emitted.
>>>>>>
>>>>>> This infinite wait cause a job timeout on the emitting job. And
>>>>>> driver will stuck at the its sched stop step because kthread_park
>>>>>> cannot be done.
>>>>>>
>>>>>> [how]
>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>>>> skip when overflow happens
>>>>>>
>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6
>>>>>> +-----
>>>>>> 2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>> amdgpu_device *adev,
>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>> }
>>>>>> - amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> -
>>>>>> /* block all schedulers and reset given job's ring */
>>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>> struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6
>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>> amdgpu_ras_suspend(tmp_adev);
>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>> +
>>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>> struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>> }
>>>>>> - if (need_emergency_restart)
>>>>>> + if (need_emergency_restart) {
>>>>>> + list_for_each_entry (tmp_adev, device_list_handle,
>>>>>> reset_list) {
>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> + }
>>>>>> goto skip_sched_resume;
>>>>>> + }
>>>>>> /*
>>>>>> * Must check guilty signal here since after this point
>>>>>> all old @@ -5240,6 +5244,9 @@ int
>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>> job_signaled = true;
>>>>>> dev_info(adev->dev, "Guilty job already signaled,
>>>>>> skipping HW
>>>>>> reset");
>>>>>> + list_for_each_entry (tmp_adev, device_list_handle,
>>>>>> reset_list) {
>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> + }
>>>>>> goto skip_hw_reset;
>>>>>> }
>>>>>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>> if (r && r == -EAGAIN) {
>>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET,
>>>>>> &reset_context->flags);
>>>>>> adev->asic_reset_res = 0;
>>>>>> + amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> goto retry;
>>>>>> }
>>>>>> }
>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t
>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>> set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>> + amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> +
>>>>>> adev->no_hw_access = true;
>>>>>> r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>> adev->no_hw_access = false;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct
>>>>>> amdgpu_ring *ring)
>>>>>> if (unlikely(seq == last_seq))
>>>>>> return false;
>>>>>> - last_seq &= drv->num_fences_mask;
>>>>>> - seq &= drv->num_fences_mask;
>>>>>> -
>>>>>> do {
>>>>>> struct dma_fence *fence, **ptr;
>>>>>> ++last_seq;
>>>>>> - last_seq &= drv->num_fences_mask;
>>>>>> - ptr = &drv->fences[last_seq];
>>>>>> + ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>> /* There is always exactly one thread signaling
>>>>>> this fence slot */
>>>>>> fence = rcu_dereference_protected(*ptr, 1);
>>>>> Those changes here doesn't seem to make sense. Please explain
>>>>> further why that is necessary.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> --
>>>>>> 2.25.1
>>>
>
More information about the amd-gfx
mailing list