[PATCH 1/5] drm/amdgpu: handle IH ring1 overflow

Thu Nov 11 13:43:01 UTC 2021

Am 11.11.21 um 13:13 schrieb Felix Kuehling:
> Am 2021-11-11 um 2:00 a.m. schrieb Christian König:
>> Am 11.11.21 um 00:36 schrieb Felix Kuehling:
>>> On 2021-11-10 9:31 a.m., Christian König wrote:
>>> [SNIP]
>>> Aren't we processing interrupts out-of-order in this case. We're
>>> processing newer ones before older ones. Is that the root of the
>>> problem because it confuses our interrupt draining function?
>> Good point.
>>
>>> Maybe we need to detect overflows in the interrupt draining function
>>> to make it wait longer in that case.
>> Ideally we should use something which is completely separate from all
>> those implementation details.
>>
>> Like for example using the timestamp or a separate indicator/counter
>> instead.
> Even a timestamp will be broken if the interrupt processing function
> handles interrupts out of order.

We can easily detect if the timestamp is going backwards and so filter 
out stale entries.

> I think we need a different amdgpu_ih_process function for IH ring 1 the
> way we set it up to handle overflows. Because IH is just overwriting
> older entries, and we can't read entire IH entries atomically, we have
> to use a watermark. If IH WPTR passes the watermark, we have to consider
> the ring overflowed and reset our RPTR. We have to set RPTR far enough
> "ahead" of the current WPTR to make sure WPTR is under the watermark.
> And the watermark needs to be low enough to ensure amdgpu_irq_dispatch
> can read out the next IH entry before the WPTR catches up with the RPTR.
>
> Since we don't read the WPTR on every iteration, and out page fault
> handling code can take quite a while to process one fault, the watermark
> needs to provide a lot of buffer. Maybe we also need to read the WPTR
> register more frequently if the last read was more than a jiffy ago.

I think trying to solve that with the IH code or hardware is the 
completely wrong approach.

As I said before we need to something more robust and using the 
timestamp sounds like the most logical approach to me.

The only alternative I can see would be to have a hardware assisted flag 
which tells you if you had an overflow or not like we have for IH ring 0.

E.g. something like the following:
1. Copy the next N IV from the RPTR location.
2. Get the current WPTR.
3. If the overflow bit in the WPTR is set update the RPTR to something 
like WPTR+window, clear the overflow bit and repeat at #1.
4. Process the valid IVs.

The down side is that we are loosing a lot of IVs with that. That is 
probably ok for the current approach, but most likely a bad idea if we 
enable the CAM.

Regards,
Christian.

>
> Whenever an overflow (over the watermark) is detected, we can set a
> sticky overflow bit that our page fault handling code can use to clean
> up. E.g. once we start using the CAM filter, we'll have to invalidate
> all CAM entries when this happens (although I'd expect overflows to
> become impossible once we enable the CAM filter).
>
> Thanks,
>    Felix
>