[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

Tue Jun 21 19:45:35 UTC 2022

On 2022-06-21 03:25, Christian König wrote:
> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>> Problem:
>> After we start handling timed out jobs we assume there fences won't be
>> signaled but we cannot be sure and sometimes they fire late. We need
>> to prevent concurrent accesses to fence array from
>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>> amdgpu_fence_process
>> from a late EOP interrupt.
>>
>> Fix:
>> Before accessing fence array in GPU disable EOP interrupt and flush
>> all pending interrupt handlers for amdgpu device's interrupt line.
>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 ++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>   3 files changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 2b92281dd0c1..c99541685804 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>           amdgpu_virt_fini_data_exchange(adev);
>>       }
>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>>       /* block all schedulers and reset given job's ring */
>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>           struct amdgpu_ring *ring = adev->rings[i];
>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>           amdgpu_fence_driver_force_completion(ring);
>>       }
>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>> +
>>       if (job && job->vm)
>>           drm_sched_increase_karma(&job->base);
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index a9ae3beaa1d3..d6d54ba4c185 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct 
>> amdgpu_device *adev)
>>       }
>>   }
>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>> bool stop)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>> +        struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +        if (!ring || !ring->fence_drv.initialized || 
>> !ring->fence_drv.irq_src)
>> +            continue;
>> +
>> +        if (stop)
>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>> +                           ring->fence_drv.irq_type);
>> +        else
>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>> +                    ring->fence_drv.irq_type);
>
> That won't work like this. This increments/decrements the reference 
> count for the IRQ, but doesn't guarantee in any way that they are 
> stopped/started.

I understand that, i just assumed that the fence driver is the only 
holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
I can disable amdgpu interrupt line totally using disable_irq - would 
this be better ?

>
>
>> +    }
>> +
>> +    /* TODO Only waits for irq handlers on other CPUs, maybe 
>> local_irq_save
>> +     * local_irq_local_irq_restore are needed here for local 
>> interrupts ?
>> +     *
>> +     */
>
> Well that comment made me smile. Think for a moment what the local CPU 
> would be doing if an interrupt would run :)

No, I understand this of course, I am ok to be interrupted by interrupt 
handler at this point, what i am trying to do
is to prevent amdgpu_fence_process to run concurrently with 
amdgpu_fence_driver_clear_job_fences - that is what this
function is trying to prevent - i disable and flush pending EOP ISR 
handlers before the call to clear fences and re-enable after.
I guess we can also introduce a spinlock to serialize them ? Yiqing 
reported seeing a race between them so we have to do something.

Andrey

>
> Cheers,
> Christian.
>
>> +    if (stop)
>> +        synchronize_irq(adev->irq.irq);
>> +}
>> +
>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>   {
>>       unsigned int i, j;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index 7d89a52091c0..82c178a9033a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>> amdgpu_ring *ring,
>>                         uint32_t wait_seq,
>>                         signed long timeout);
>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool 
>> stop);
>>     /*
>>    * Rings.
>