[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
Christian König
ckoenig.leichtzumerken at gmail.com
Tue Jun 21 07:25:13 UTC 2022
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
> Problem:
> After we start handling timed out jobs we assume there fences won't be
> signaled but we cannot be sure and sometimes they fire late. We need
> to prevent concurrent accesses to fence array from
> amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
> from a late EOP interrupt.
>
> Fix:
> Before accessing fence array in GPU disable EOP interrupt and flush
> all pending interrupt handlers for amdgpu device's interrupt line.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 26 ++++++++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
> 3 files changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2b92281dd0c1..c99541685804 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> amdgpu_virt_fini_data_exchange(adev);
> }
>
> + amdgpu_fence_driver_isr_toggle(adev, true);
> +
> /* block all schedulers and reset given job's ring */
> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> struct amdgpu_ring *ring = adev->rings[i];
> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> amdgpu_fence_driver_force_completion(ring);
> }
>
> + amdgpu_fence_driver_isr_toggle(adev, false);
> +
> if (job && job->vm)
> drm_sched_increase_karma(&job->base);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index a9ae3beaa1d3..d6d54ba4c185 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
> }
> }
>
> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop)
> +{
> + int i;
> +
> + for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> + struct amdgpu_ring *ring = adev->rings[i];
> +
> + if (!ring || !ring->fence_drv.initialized || !ring->fence_drv.irq_src)
> + continue;
> +
> + if (stop)
> + amdgpu_irq_put(adev, ring->fence_drv.irq_src,
> + ring->fence_drv.irq_type);
> + else
> + amdgpu_irq_get(adev, ring->fence_drv.irq_src,
> + ring->fence_drv.irq_type);
That won't work like this. This increments/decrements the reference
count for the IRQ, but doesn't guarantee in any way that they are
stopped/started.
> + }
> +
> + /* TODO Only waits for irq handlers on other CPUs, maybe local_irq_save
> + * local_irq_local_irq_restore are needed here for local interrupts ?
> + *
> + */
Well that comment made me smile. Think for a moment what the local CPU
would be doing if an interrupt would run :)
Cheers,
Christian.
> + if (stop)
> + synchronize_irq(adev->irq.irq);
> +}
> +
> void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> {
> unsigned int i, j;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 7d89a52091c0..82c178a9033a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
> uint32_t wait_seq,
> signed long timeout);
> unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop);
>
> /*
> * Rings.
More information about the amd-gfx
mailing list