[PATCH 1/5] drm/amdgpu: handle IH ring1 overflow

Thu Nov 11 07:00:49 UTC 2021

Am 11.11.21 um 00:36 schrieb Felix Kuehling:
> On 2021-11-10 9:31 a.m., Christian König wrote:
>> Am 10.11.21 um 14:59 schrieb philip yang:
>>>
>>> On 2021-11-10 5:15 a.m., Christian König wrote:
>>>
>>>> [SNIP]
>>>
>>> It is hard to understand, this debug log can explain more details, 
>>> with this debug message patch
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> index ed6f8d24280b..8859f2bb11b1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> @@ -234,10 +235,12 @@ int amdgpu_ih_process(struct amdgpu_device 
>>> *adev, struct amdgpu_ih_ring *ih)
>>>                 return IRQ_NONE;
>>>
>>>         wptr = amdgpu_ih_get_wptr(adev, ih);
>>> +       if (ih == &adev->irq.ih1)
>>> +               pr_debug("entering rptr 0x%x, wptr 0x%x\n", 
>>> ih->rptr, wptr);
>>>
>>>  restart_ih:
>>> +       if (ih == &adev->irq.ih1)
>>> +               pr_debug("starting rptr 0x%x, wptr 0x%x\n", 
>>> ih->rptr, wptr);
>>>
>>>         /* Order reading of wptr vs. reading of IH ring data */
>>>         rmb();
>>> @@ -245,8 +248,12 @@ int amdgpu_ih_process(struct amdgpu_device 
>>> *adev, struct amdgpu_ih_ring *ih)
>>>         while (ih->rptr != wptr && --count) {
>>>                 amdgpu_irq_dispatch(adev, ih);
>>>                 ih->rptr &= ih->ptr_mask;
>>> +               if (ih == &adev->irq.ih1) {
>>> +                       pr_debug("rptr 0x%x, old wptr 0x%x, new wptr 
>>> 0x%x\n",
>>> +                               ih->rptr, wptr,
>>> +                               amdgpu_ih_get_wptr(adev, ih));
>>> +               }
>>>         }
>>>
>>>         amdgpu_ih_set_rptr(adev, ih);
>>> @@ -257,6 +264,8 @@ int amdgpu_ih_process(struct amdgpu_device 
>>> *adev, struct amdgpu_ih_ring *ih)
>>>         if (wptr != ih->rptr)
>>>                 goto restart_ih;
>>>
>>> +       if (ih == &adev->irq.ih1)
>>> +               pr_debug("exiting rptr 0x%x, wptr 0x%x\n", ih->rptr, 
>>> wptr);
>>>         return IRQ_HANDLED;
>>>  }
>>>
>>> This is log, timing 48.807028, ring1 drain is done, rptr == wptr, 
>>> ring1 is empty, but the loop continues, to handle outdated retry fault.
>>>
>>
>> As far as I can see that is perfectly correct and expected behavior.
>>
>> See the ring buffer overflowed and because of that the loop 
>> continues, but that is correct because an overflow means that the 
>> ring was filled with new entries.
>>
>> So we are processing new entries here, not stale ones.
>
> Aren't we processing interrupts out-of-order in this case. We're 
> processing newer ones before older ones. Is that the root of the 
> problem because it confuses our interrupt draining function?

Good point.

> Maybe we need to detect overflows in the interrupt draining function 
> to make it wait longer in that case.

Ideally we should use something which is completely separate from all 
those implementation details.

Like for example using the timestamp or a separate indicator/counter 
instead.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>>> [   48.802231] amdgpu_ih_process:243: amdgpu: starting rptr 0x520, 
>>> wptr 0xd20
>>> [   48.802235] amdgpu_ih_process:254: amdgpu: rptr 0x540, old wptr 
>>> 0xd20, new wptr 0xd20
>>> [   48.802256] amdgpu_ih_process:254: amdgpu: rptr 0x560, old wptr 
>>> 0xd20, new wptr 0xd20
>>> [   48.802260] amdgpu_ih_process:254: amdgpu: rptr 0x580, old wptr 
>>> 0xd20, new wptr 0xd20
>>> [   48.802281] amdgpu_ih_process:254: amdgpu: rptr 0x5a0, old wptr 
>>> 0xd20, new wptr 0xd20
>>> [   48.802314] amdgpu_ih_process:254: amdgpu: rptr 0x5c0, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802335] amdgpu_ih_process:254: amdgpu: rptr 0x5e0, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802356] amdgpu_ih_process:254: amdgpu: rptr 0x600, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802376] amdgpu_ih_process:254: amdgpu: rptr 0x620, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802396] amdgpu_ih_process:254: amdgpu: rptr 0x640, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802401] amdgpu_ih_process:254: amdgpu: rptr 0x660, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802421] amdgpu_ih_process:254: amdgpu: rptr 0x680, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802442] amdgpu_ih_process:254: amdgpu: rptr 0x6a0, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802463] amdgpu_ih_process:254: amdgpu: rptr 0x6c0, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802483] amdgpu_ih_process:254: amdgpu: rptr 0x6e0, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802503] amdgpu_ih_process:254: amdgpu: rptr 0x700, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802523] amdgpu_ih_process:254: amdgpu: rptr 0x720, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802544] amdgpu_ih_process:254: amdgpu: rptr 0x740, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802565] amdgpu_ih_process:254: amdgpu: rptr 0x760, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.802569] amdgpu_ih_process:254: amdgpu: rptr 0x780, old wptr 
>>> 0xd20, new wptr 0xce0
>>> [   48.804392] amdgpu_ih_process:254: amdgpu: rptr 0x7a0, old wptr 
>>> 0xd20, new wptr 0xf00
>>> [   48.806122] amdgpu_ih_process:254: amdgpu: rptr 0x7c0, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.806155] amdgpu_ih_process:254: amdgpu: rptr 0x7e0, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.806965] amdgpu_ih_process:254: amdgpu: rptr 0x800, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.806995] amdgpu_ih_process:254: amdgpu: rptr 0x820, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.807028] amdgpu_ih_process:254: amdgpu: rptr 0x840, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.807063] amdgpu_ih_process:254: amdgpu: rptr 0x860, old wptr 
>>> 0xd20, new wptr 0x840
>>> [   48.808421] amdgpu_ih_process:254: amdgpu: rptr 0x880, old wptr 
>>> 0xd20, new wptr 0x840
>>>
>>> Cause this gpu vm fault dump because address is unmapped from cpu.
>>>
>>> [   48.807071] svm_range_restore_pages:2617: amdgpu: restoring svms 
>>> 0x00000000733bf007 fault address 0x7f8a6991f
>>>
>>> [   48.807170] svm_range_restore_pages:2631: amdgpu: failed to find 
>>> prange svms 0x00000000733bf007 address [0x7f8a6991f]
>>> [   48.807179] svm_range_get_range_boundaries:2348: amdgpu: VMA does 
>>> not exist in address [0x7f8a6991f]
>>> [   48.807185] svm_range_restore_pages:2635: amdgpu: failed to 
>>> create unregistered range svms 0x00000000733bf007 address [0x7f8a6991f]
>>>
>>> [   48.807929] amdgpu 0000:25:00.0: amdgpu: [mmhub0] retry page 
>>> fault (src_id:0 ring:0 vmid:8 pasid:32770, for process kfdtest pid 
>>> 3969 thread kfdtest pid 3969)
>>> [   48.808219] amdgpu 0000:25:00.0: amdgpu:   in page starting at 
>>> address 0x00007f8a6991f000 from IH client 0x12 (VMC)
>>> [   48.808230] amdgpu 0000:25:00.0: amdgpu: 
>>> VM_L2_PROTECTION_FAULT_STATUS:0x00800031
>>>
>>>> We could of course parameterize that so that we check the wptr 
>>>> after each IV on IH1, but please not hard coded like this.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>       }
>>>>>         amdgpu_ih_set_rptr(adev, ih);
>>>>
>>