[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

Wed Jun 22 01:47:19 UTC 2022

Hi,

Is this a preventive fix or you found errors/oops/hangs?
If you had found errors/oops/hangs, can you please share the details?

BR,
Chandan V N

>On 2022-06-21 03:25, Christian König wrote:
>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>> Problem:
>>> After we start handling timed out jobs we assume there fences won't 
>>> be signaled but we cannot be sure and sometimes they fire late. We 
>>> need to prevent concurrent accesses to fence array from 
>>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>>> amdgpu_fence_process from a late EOP interrupt.
>>>
>>> Fix:
>>> Before accessing fence array in GPU disable EOP interrupt and flush 
>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 
>>> ++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>   3 files changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 2b92281dd0c1..c99541685804 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct
>>> amdgpu_device *adev,
>>>           amdgpu_virt_fini_data_exchange(adev);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>>> +
>>>       /* block all schedulers and reset given job's ring */
>>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>           struct amdgpu_ring *ring = adev->rings[i]; @@ -4620,6 
>>> +4622,8 @@ int amdgpu_device_pre_asic_reset(struct
>>> amdgpu_device *adev,
>>>           amdgpu_fence_driver_force_completion(ring);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>>> +
>>>       if (job && job->vm)
>>>           drm_sched_increase_karma(&job->base);
>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct
>>> amdgpu_device *adev)
>>>       }
>>>   }
>>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>> bool stop)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> +        if (!ring || !ring->fence_drv.initialized ||
>>> !ring->fence_drv.irq_src)
>>> +            continue;
>>> +
>>> +        if (stop)
>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>> +                           ring->fence_drv.irq_type);
>>> +        else
>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>> +                    ring->fence_drv.irq_type);
>>
>> That won't work like this. This increments/decrements the reference 
>> count for the IRQ, but doesn't guarantee in any way that they are 
>> stopped/started.
>
>
>I understand that, i just assumed that the fence driver is the only holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
>I can disable amdgpu interrupt line totally using disable_irq - would this be better ?
>
>
>>
>>
>>> +    }
>>> +
>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe
>>> local_irq_save
>>> +     * local_irq_local_irq_restore are needed here for local
>>> interrupts ?
>>> +     *
>>> +     */
>>
>> Well that comment made me smile. Think for a moment what the local CPU 
>> would be doing if an interrupt would run :)
>
>
>No, I understand this of course, I am ok to be interrupted by interrupt handler at this point, what i am trying to do is to prevent amdgpu_fence_process to run concurrently with amdgpu_fence_driver_clear_job_fences - that is what this function is trying to prevent - i disable and >flush pending EOP ISR handlers before the call to clear fences and re-enable after.
>I guess we can also introduce a spinlock to serialize them ? Yiqing reported seeing a race between them so we have to do something.
>
>Andrey
>
>
>>
>> Cheers,
>> Christian.
>>
>>> +    if (stop)
>>> +        synchronize_irq(adev->irq.irq); }
>>> +
>>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>   {
>>>       unsigned int i, j;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index 7d89a52091c0..82c178a9033a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>>> amdgpu_ring *ring,
>>>                         uint32_t wait_seq,
>>>                         signed long timeout);
>>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool
>>> stop);
>>>     /*
>>>    * Rings.
>>