[PATCH 2/2] drm/amdgpu: Process fences on IH overflow

Tue Jan 16 07:17:43 UTC 2024

Am 15.01.24 um 12:19 schrieb Friedrich Vock:
> On 15.01.24 11:26, Christian König wrote:
>> Am 14.01.24 um 14:00 schrieb Friedrich Vock:
>>> If the IH ring buffer overflows, it's possible that fence signal events
>>> were lost. Check each ring for progress to prevent job timeouts/GPU
>>> hangs due to the fences staying unsignaled despite the work being done.
>>
>> That's completely unnecessary and in some cases even harmful.
> How is it harmful? The only effect it can have is prevent unnecessary
> GPU hangs, no? It's not like it hides any legitimate errors that you'd
> otherwise see.

We have no guarantee that all ring buffers are actually fully 
initialized to allow fence processing.

Apart from that fence processing is the least of your problems when an 
IV overflow occurs. Other interrupt source which are not repeated are 
usually for more worse.

>>
>> We already have a timeout handler for that and overflows point to
>> severe system problem so they should never occur in a production system.
>
> IH ring buffer overflows are pretty reliably reproducible if you trigger
> a lot of page faults, at least on Deck. Why shouldn't enough page faults
> in quick succession be able to overflow the IH ring buffer?

At least not on recent hw generations. Since gfx9 we have a rate limit 
on the number of page faults generated.

What could maybe do as well is to change the default of vm_fault_stop, 
but for your case that would be even worse in production.

>
> The fence fallback timer as it is now is useless for this because it
> only gets triggered once after 0.5s. I guess an alternative approach
> would be to make a timer trigger for each work item in flight every
> 0.5s, but why should that be better than just handling overflow errors
> as they occur?

That is intentional. As I said an IH overflow just points out that there 
is something massively wrong in the HW programming.

After gfx9 the IH should never produce overflow any more, otherwise 
either the ratelimit doesn't work or isn't enabled for some reason or 
the IH ring buffer is just to small.

Regards,
Christian.

>
> Regards,
> Friedrich
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Cc: Joshua Ashton <joshua at froggi.es>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Cc: stable at vger.kernel.org
>>>
>>> Signed-off-by: Friedrich Vock <friedrich.vock at gmx.de>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> index f3b0aaf3ebc6..2a246db1d3a7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> @@ -209,6 +209,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev,
>>> struct amdgpu_ih_ring *ih)
>>>   {
>>>       unsigned int count;
>>>       u32 wptr;
>>> +    int i;
>>>
>>>       if (!ih->enabled || adev->shutdown)
>>>           return IRQ_NONE;
>>> @@ -227,6 +228,20 @@ int amdgpu_ih_process(struct amdgpu_device
>>> *adev, struct amdgpu_ih_ring *ih)
>>>           ih->rptr &= ih->ptr_mask;
>>>       }
>>>
>>> +    /* If the ring buffer overflowed, we might have lost some fence
>>> +     * signal interrupts. Check if there was any activity so the 
>>> signal
>>> +     * doesn't get lost.
>>> +     */
>>> +    if (ih->overflow) {
>>> +        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> +            struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> +            if (!ring || !ring->fence_drv.initialized)
>>> +                continue;
>>> +            amdgpu_fence_process(ring);
>>> +        }
>>> +    }
>>> +
>>>       amdgpu_ih_set_rptr(adev, ih);
>>>       wake_up_all(&ih->wait_process);
>>>
>>> -- 
>>> 2.43.0
>>>
>>