[PATCH 1/5] drm/amdgpu: handle IH ring1 overflow
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Nov 11 07:00:49 UTC 2021
Am 11.11.21 um 00:36 schrieb Felix Kuehling:
> On 2021-11-10 9:31 a.m., Christian König wrote:
>> Am 10.11.21 um 14:59 schrieb philip yang:
>>>
>>> On 2021-11-10 5:15 a.m., Christian König wrote:
>>>
>>>> [SNIP]
>>>
>>> It is hard to understand, this debug log can explain more details,
>>> with this debug message patch
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> index ed6f8d24280b..8859f2bb11b1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> @@ -234,10 +235,12 @@ int amdgpu_ih_process(struct amdgpu_device
>>> *adev, struct amdgpu_ih_ring *ih)
>>> return IRQ_NONE;
>>>
>>> wptr = amdgpu_ih_get_wptr(adev, ih);
>>> + if (ih == &adev->irq.ih1)
>>> + pr_debug("entering rptr 0x%x, wptr 0x%x\n",
>>> ih->rptr, wptr);
>>>
>>> restart_ih:
>>> + if (ih == &adev->irq.ih1)
>>> + pr_debug("starting rptr 0x%x, wptr 0x%x\n",
>>> ih->rptr, wptr);
>>>
>>> /* Order reading of wptr vs. reading of IH ring data */
>>> rmb();
>>> @@ -245,8 +248,12 @@ int amdgpu_ih_process(struct amdgpu_device
>>> *adev, struct amdgpu_ih_ring *ih)
>>> while (ih->rptr != wptr && --count) {
>>> amdgpu_irq_dispatch(adev, ih);
>>> ih->rptr &= ih->ptr_mask;
>>> + if (ih == &adev->irq.ih1) {
>>> + pr_debug("rptr 0x%x, old wptr 0x%x, new wptr
>>> 0x%x\n",
>>> + ih->rptr, wptr,
>>> + amdgpu_ih_get_wptr(adev, ih));
>>> + }
>>> }
>>>
>>> amdgpu_ih_set_rptr(adev, ih);
>>> @@ -257,6 +264,8 @@ int amdgpu_ih_process(struct amdgpu_device
>>> *adev, struct amdgpu_ih_ring *ih)
>>> if (wptr != ih->rptr)
>>> goto restart_ih;
>>>
>>> + if (ih == &adev->irq.ih1)
>>> + pr_debug("exiting rptr 0x%x, wptr 0x%x\n", ih->rptr,
>>> wptr);
>>> return IRQ_HANDLED;
>>> }
>>>
>>> This is log, timing 48.807028, ring1 drain is done, rptr == wptr,
>>> ring1 is empty, but the loop continues, to handle outdated retry fault.
>>>
>>
>> As far as I can see that is perfectly correct and expected behavior.
>>
>> See the ring buffer overflowed and because of that the loop
>> continues, but that is correct because an overflow means that the
>> ring was filled with new entries.
>>
>> So we are processing new entries here, not stale ones.
>
> Aren't we processing interrupts out-of-order in this case. We're
> processing newer ones before older ones. Is that the root of the
> problem because it confuses our interrupt draining function?
Good point.
> Maybe we need to detect overflows in the interrupt draining function
> to make it wait longer in that case.
Ideally we should use something which is completely separate from all
those implementation details.
Like for example using the timestamp or a separate indicator/counter
instead.
Regards,
Christian.
>
> Regards,
> Felix
>
>
>>
>> Regards,
>> Christian.
>>
>>> [ 48.802231] amdgpu_ih_process:243: amdgpu: starting rptr 0x520,
>>> wptr 0xd20
>>> [ 48.802235] amdgpu_ih_process:254: amdgpu: rptr 0x540, old wptr
>>> 0xd20, new wptr 0xd20
>>> [ 48.802256] amdgpu_ih_process:254: amdgpu: rptr 0x560, old wptr
>>> 0xd20, new wptr 0xd20
>>> [ 48.802260] amdgpu_ih_process:254: amdgpu: rptr 0x580, old wptr
>>> 0xd20, new wptr 0xd20
>>> [ 48.802281] amdgpu_ih_process:254: amdgpu: rptr 0x5a0, old wptr
>>> 0xd20, new wptr 0xd20
>>> [ 48.802314] amdgpu_ih_process:254: amdgpu: rptr 0x5c0, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802335] amdgpu_ih_process:254: amdgpu: rptr 0x5e0, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802356] amdgpu_ih_process:254: amdgpu: rptr 0x600, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802376] amdgpu_ih_process:254: amdgpu: rptr 0x620, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802396] amdgpu_ih_process:254: amdgpu: rptr 0x640, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802401] amdgpu_ih_process:254: amdgpu: rptr 0x660, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802421] amdgpu_ih_process:254: amdgpu: rptr 0x680, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802442] amdgpu_ih_process:254: amdgpu: rptr 0x6a0, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802463] amdgpu_ih_process:254: amdgpu: rptr 0x6c0, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802483] amdgpu_ih_process:254: amdgpu: rptr 0x6e0, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802503] amdgpu_ih_process:254: amdgpu: rptr 0x700, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802523] amdgpu_ih_process:254: amdgpu: rptr 0x720, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802544] amdgpu_ih_process:254: amdgpu: rptr 0x740, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802565] amdgpu_ih_process:254: amdgpu: rptr 0x760, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.802569] amdgpu_ih_process:254: amdgpu: rptr 0x780, old wptr
>>> 0xd20, new wptr 0xce0
>>> [ 48.804392] amdgpu_ih_process:254: amdgpu: rptr 0x7a0, old wptr
>>> 0xd20, new wptr 0xf00
>>> [ 48.806122] amdgpu_ih_process:254: amdgpu: rptr 0x7c0, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.806155] amdgpu_ih_process:254: amdgpu: rptr 0x7e0, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.806965] amdgpu_ih_process:254: amdgpu: rptr 0x800, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.806995] amdgpu_ih_process:254: amdgpu: rptr 0x820, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.807028] amdgpu_ih_process:254: amdgpu: rptr 0x840, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.807063] amdgpu_ih_process:254: amdgpu: rptr 0x860, old wptr
>>> 0xd20, new wptr 0x840
>>> [ 48.808421] amdgpu_ih_process:254: amdgpu: rptr 0x880, old wptr
>>> 0xd20, new wptr 0x840
>>>
>>> Cause this gpu vm fault dump because address is unmapped from cpu.
>>>
>>> [ 48.807071] svm_range_restore_pages:2617: amdgpu: restoring svms
>>> 0x00000000733bf007 fault address 0x7f8a6991f
>>>
>>> [ 48.807170] svm_range_restore_pages:2631: amdgpu: failed to find
>>> prange svms 0x00000000733bf007 address [0x7f8a6991f]
>>> [ 48.807179] svm_range_get_range_boundaries:2348: amdgpu: VMA does
>>> not exist in address [0x7f8a6991f]
>>> [ 48.807185] svm_range_restore_pages:2635: amdgpu: failed to
>>> create unregistered range svms 0x00000000733bf007 address [0x7f8a6991f]
>>>
>>> [ 48.807929] amdgpu 0000:25:00.0: amdgpu: [mmhub0] retry page
>>> fault (src_id:0 ring:0 vmid:8 pasid:32770, for process kfdtest pid
>>> 3969 thread kfdtest pid 3969)
>>> [ 48.808219] amdgpu 0000:25:00.0: amdgpu: in page starting at
>>> address 0x00007f8a6991f000 from IH client 0x12 (VMC)
>>> [ 48.808230] amdgpu 0000:25:00.0: amdgpu:
>>> VM_L2_PROTECTION_FAULT_STATUS:0x00800031
>>>
>>>> We could of course parameterize that so that we check the wptr
>>>> after each IV on IH1, but please not hard coded like this.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> }
>>>>> amdgpu_ih_set_rptr(adev, ih);
>>>>
>>
More information about the amd-gfx
mailing list