[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Jun 21 19:45:35 UTC 2022
On 2022-06-21 03:25, Christian König wrote:
> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>> Problem:
>> After we start handling timed out jobs we assume there fences won't be
>> signaled but we cannot be sure and sometimes they fire late. We need
>> to prevent concurrent accesses to fence array from
>> amdgpu_fence_driver_clear_job_fences during GPU reset and
>> amdgpu_fence_process
>> from a late EOP interrupt.
>>
>> Fix:
>> Before accessing fence array in GPU disable EOP interrupt and flush
>> all pending interrupt handlers for amdgpu device's interrupt line.
>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 26 ++++++++++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>> 3 files changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 2b92281dd0c1..c99541685804 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct
>> amdgpu_device *adev,
>> amdgpu_virt_fini_data_exchange(adev);
>> }
>> + amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>> /* block all schedulers and reset given job's ring */
>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> struct amdgpu_ring *ring = adev->rings[i];
>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct
>> amdgpu_device *adev,
>> amdgpu_fence_driver_force_completion(ring);
>> }
>> + amdgpu_fence_driver_isr_toggle(adev, false);
>> +
>> if (job && job->vm)
>> drm_sched_increase_karma(&job->base);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index a9ae3beaa1d3..d6d54ba4c185 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct
>> amdgpu_device *adev)
>> }
>> }
>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev,
>> bool stop)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>> + struct amdgpu_ring *ring = adev->rings[i];
>> +
>> + if (!ring || !ring->fence_drv.initialized ||
>> !ring->fence_drv.irq_src)
>> + continue;
>> +
>> + if (stop)
>> + amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>> + ring->fence_drv.irq_type);
>> + else
>> + amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>> + ring->fence_drv.irq_type);
>
> That won't work like this. This increments/decrements the reference
> count for the IRQ, but doesn't guarantee in any way that they are
> stopped/started.
I understand that, i just assumed that the fence driver is the only
holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
I can disable amdgpu interrupt line totally using disable_irq - would
this be better ?
>
>
>> + }
>> +
>> + /* TODO Only waits for irq handlers on other CPUs, maybe
>> local_irq_save
>> + * local_irq_local_irq_restore are needed here for local
>> interrupts ?
>> + *
>> + */
>
> Well that comment made me smile. Think for a moment what the local CPU
> would be doing if an interrupt would run :)
No, I understand this of course, I am ok to be interrupted by interrupt
handler at this point, what i am trying to do
is to prevent amdgpu_fence_process to run concurrently with
amdgpu_fence_driver_clear_job_fences - that is what this
function is trying to prevent - i disable and flush pending EOP ISR
handlers before the call to clear fences and re-enable after.
I guess we can also introduce a spinlock to serialize them ? Yiqing
reported seeing a race between them so we have to do something.
Andrey
>
> Cheers,
> Christian.
>
>> + if (stop)
>> + synchronize_irq(adev->irq.irq);
>> +}
>> +
>> void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>> {
>> unsigned int i, j;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index 7d89a52091c0..82c178a9033a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct
>> amdgpu_ring *ring,
>> uint32_t wait_seq,
>> signed long timeout);
>> unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool
>> stop);
>> /*
>> * Rings.
>
More information about the amd-gfx
mailing list