[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

Thu Jun 23 05:57:28 UTC 2022

Am 22.06.22 um 19:31 schrieb Andrey Grodzovsky:
> Just a ping

You need to give me at least some time to look into this.

>
> Andrey
>
> On 2022-06-21 15:45, Andrey Grodzovsky wrote:
>>
>> On 2022-06-21 03:25, Christian König wrote:
>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> After we start handling timed out jobs we assume there fences won't be
>>>> signaled but we cannot be sure and sometimes they fire late. We need
>>>> to prevent concurrent accesses to fence array from
>>>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>>>> amdgpu_fence_process
>>>> from a late EOP interrupt.
>>>>
>>>> Fix:
>>>> Before accessing fence array in GPU disable EOP interrupt and flush
>>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>>
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 
>>>> ++++++++++++++++++++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>>   3 files changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 2b92281dd0c1..c99541685804 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>       }
>>>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> +
>>>>       /* block all schedulers and reset given job's ring */
>>>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>           struct amdgpu_ring *ring = adev->rings[i];
>>>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>           amdgpu_fence_driver_force_completion(ring);
>>>>       }
>>>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>>>> +
>>>>       if (job && job->vm)
>>>>           drm_sched_increase_karma(&job->base);
>>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct 
>>>> amdgpu_device *adev)
>>>>       }
>>>>   }
>>>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>>> bool stop)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> +        if (!ring || !ring->fence_drv.initialized || 
>>>> !ring->fence_drv.irq_src)
>>>> +            continue;
>>>> +
>>>> +        if (stop)
>>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>> +                           ring->fence_drv.irq_type);
>>>> +        else
>>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>>> +                    ring->fence_drv.irq_type);
>>>
>>> That won't work like this. This increments/decrements the reference 
>>> count for the IRQ, but doesn't guarantee in any way that they are 
>>> stopped/started.
>>
>>
>> I understand that, i just assumed that the fence driver is the only 
>> holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?

I'm not 100% sure of that. The original idea was to enable/disable 
interrupt sources as they came along.

>> I can disable amdgpu interrupt line totally using disable_irq - would 
>> this be better ?

Yes, that's what I thought as well.

>>
>>
>>>
>>>
>>>> +    }
>>>> +
>>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe 
>>>> local_irq_save
>>>> +     * local_irq_local_irq_restore are needed here for local 
>>>> interrupts ?
>>>> +     *
>>>> +     */
>>>
>>> Well that comment made me smile. Think for a moment what the local 
>>> CPU would be doing if an interrupt would run :)
>>
>>
>> No, I understand this of course, I am ok to be interrupted by 
>> interrupt handler at this point, what i am trying to do
>> is to prevent amdgpu_fence_process to run concurrently with 
>> amdgpu_fence_driver_clear_job_fences - that is what this
>> function is trying to prevent - i disable and flush pending EOP ISR 
>> handlers before the call to clear fences and re-enable after.

That should be sufficient. When a local interrupt would run the code 
here wouldn't be executing.

>> I guess we can also introduce a spinlock to serialize them ? Yiqing 
>> reported seeing a race between them so we have to do something.

A spinlock would be an absolute NAK because we have been working quite 
hard to remove them (for multiple reasons).

Christian.

>>
>> Andrey
>>
>>
>>>
>>> Cheers,
>>> Christian.
>>>
>>>> +    if (stop)
>>>> +        synchronize_irq(adev->irq.irq);
>>>> +}
>>>> +
>>>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>   {
>>>>       unsigned int i, j;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 7d89a52091c0..82c178a9033a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>>>> amdgpu_ring *ring,
>>>>                         uint32_t wait_seq,
>>>>                         signed long timeout);
>>>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>>> bool stop);
>>>>     /*
>>>>    * Rings.
>>>