<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2021-11-10 5:15 a.m., Christian
König wrote:<br>
</div>
<blockquote type="cite" cite="mid:1862b795-3401-b89f-089b-4b544957d150@gmail.com">Am
10.11.21 um 00:04 schrieb Philip Yang:
<br>
<blockquote type="cite">IH ring1 is used to process GPU retry
fault, overflow is enabled to
<br>
drain retry fault before unmapping the range, wptr may pass
rptr,
<br>
amdgpu_ih_process should check rptr equals to the latest wptr to
exit,
<br>
otherwise it will continue to recover outdatad retry fault after
drain
<br>
retry fault is done, and generate false GPU vm fault because
range is
<br>
unmapped from cpu.
<br>
<br>
Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
<br>
---
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 7 ++++++-
<br>
1 file changed, 6 insertions(+), 1 deletion(-)
<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
<br>
index f3d62e196901..d1ef61811169 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
<br>
@@ -223,7 +223,7 @@ int
amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device *adev,
<br>
*/
<br>
int amdgpu_ih_process(struct amdgpu_device *adev, struct
amdgpu_ih_ring *ih)
<br>
{
<br>
- unsigned int count = AMDGPU_IH_MAX_NUM_IVS;
<br>
+ unsigned int count;
<br>
u32 wptr;
<br>
if (!ih->enabled || adev->shutdown)
<br>
@@ -232,6 +232,8 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)
<br>
wptr = amdgpu_ih_get_wptr(adev, ih);
<br>
restart_ih:
<br>
+ count = AMDGPU_IH_MAX_NUM_IVS;
<br>
+
<br>
</blockquote>
<br>
This looks like a bugfix to me and should probably be in a
separate patch with CC: stable.
<br>
</blockquote>
ok, will add separate patch for this fix. I think this bug shows up
now for ring1 because retry fault is burst and flooding even after
filter.<br>
<blockquote type="cite" cite="mid:1862b795-3401-b89f-089b-4b544957d150@gmail.com">
<br>
<blockquote type="cite"> DRM_DEBUG("%s: rptr %d, wptr %d\n",
__func__, ih->rptr, wptr);
<br>
/* Order reading of wptr vs. reading of IH ring data */
<br>
@@ -240,6 +242,9 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)
<br>
while (ih->rptr != wptr && --count) {
<br>
amdgpu_irq_dispatch(adev, ih);
<br>
ih->rptr &= ih->ptr_mask;
<br>
+
<br>
+ if (ih == &adev->irq.ih1)
<br>
+ wptr = amdgpu_ih_get_wptr(adev, ih);
<br>
</blockquote>
<br>
Well that handling does not really make much sense.
<br>
<br>
The AMDGPU_IH_MAX_NUM_IVS define controls how many IVs we can
process before checking the wptr again.
<br>
<br>
</blockquote>
<p>It is hard to understand, this debug log can explain more
details, with this debug message patch</p>
<p>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c<br>
index ed6f8d24280b..8859f2bb11b1 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c<br>
@@ -234,10 +235,12 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)<br>
return IRQ_NONE;<br>
<br>
wptr = amdgpu_ih_get_wptr(adev, ih);<br>
+ if (ih == &adev->irq.ih1)<br>
+ pr_debug("entering rptr 0x%x, wptr 0x%x\n",
ih->rptr, wptr);<br>
<br>
restart_ih:<br>
+ if (ih == &adev->irq.ih1)<br>
+ pr_debug("starting rptr 0x%x, wptr 0x%x\n",
ih->rptr, wptr);<br>
<br>
/* Order reading of wptr vs. reading of IH ring data */<br>
rmb();<br>
@@ -245,8 +248,12 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)<br>
while (ih->rptr != wptr && --count) {<br>
amdgpu_irq_dispatch(adev, ih);<br>
ih->rptr &= ih->ptr_mask;<br>
+ if (ih == &adev->irq.ih1) {<br>
+ pr_debug("rptr 0x%x, old wptr 0x%x, new
wptr 0x%x\n",<br>
+ ih->rptr, wptr,<br>
+ amdgpu_ih_get_wptr(adev, ih));<br>
+ }<br>
}<br>
<br>
amdgpu_ih_set_rptr(adev, ih);<br>
@@ -257,6 +264,8 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)<br>
if (wptr != ih->rptr)<br>
goto restart_ih;<br>
<br>
+ if (ih == &adev->irq.ih1)<br>
+ pr_debug("exiting rptr 0x%x, wptr 0x%x\n",
ih->rptr, wptr);<br>
return IRQ_HANDLED;<br>
}<br>
<br>
</p>
<p>This is log, timing 48.807028, ring1 drain is done, rptr == wptr,
ring1 is empty, but the loop continues, to handle outdated retry
fault.<br>
</p>
<p>[ 48.802231] amdgpu_ih_process:243: amdgpu: starting rptr
0x520, wptr 0xd20<br>
[ 48.802235] amdgpu_ih_process:254: amdgpu: rptr 0x540, old wptr
0xd20, new wptr 0xd20<br>
[ 48.802256] amdgpu_ih_process:254: amdgpu: rptr 0x560, old wptr
0xd20, new wptr 0xd20<br>
[ 48.802260] amdgpu_ih_process:254: amdgpu: rptr 0x580, old wptr
0xd20, new wptr 0xd20<br>
[ 48.802281] amdgpu_ih_process:254: amdgpu: rptr 0x5a0, old wptr
0xd20, new wptr 0xd20<br>
[ 48.802314] amdgpu_ih_process:254: amdgpu: rptr 0x5c0, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802335] amdgpu_ih_process:254: amdgpu: rptr 0x5e0, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802356] amdgpu_ih_process:254: amdgpu: rptr 0x600, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802376] amdgpu_ih_process:254: amdgpu: rptr 0x620, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802396] amdgpu_ih_process:254: amdgpu: rptr 0x640, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802401] amdgpu_ih_process:254: amdgpu: rptr 0x660, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802421] amdgpu_ih_process:254: amdgpu: rptr 0x680, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802442] amdgpu_ih_process:254: amdgpu: rptr 0x6a0, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802463] amdgpu_ih_process:254: amdgpu: rptr 0x6c0, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802483] amdgpu_ih_process:254: amdgpu: rptr 0x6e0, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802503] amdgpu_ih_process:254: amdgpu: rptr 0x700, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802523] amdgpu_ih_process:254: amdgpu: rptr 0x720, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802544] amdgpu_ih_process:254: amdgpu: rptr 0x740, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802565] amdgpu_ih_process:254: amdgpu: rptr 0x760, old wptr
0xd20, new wptr 0xce0<br>
[ 48.802569] amdgpu_ih_process:254: amdgpu: rptr 0x780, old wptr
0xd20, new wptr 0xce0<br>
[ 48.804392] amdgpu_ih_process:254: amdgpu: rptr 0x7a0, old wptr
0xd20, new wptr 0xf00<br>
[ 48.806122] amdgpu_ih_process:254: amdgpu: rptr 0x7c0, old wptr
0xd20, new wptr 0x840<br>
[ 48.806155] amdgpu_ih_process:254: amdgpu: rptr 0x7e0, old wptr
0xd20, new wptr 0x840<br>
[ 48.806965] amdgpu_ih_process:254: amdgpu: rptr 0x800, old wptr
0xd20, new wptr 0x840<br>
[ 48.806995] amdgpu_ih_process:254: amdgpu: rptr 0x820, old wptr
0xd20, new wptr 0x840<br>
[ 48.807028] amdgpu_ih_process:254: amdgpu: rptr 0x840, old wptr
0xd20, new wptr 0x840<br>
[ 48.807063] amdgpu_ih_process:254: amdgpu: rptr 0x860, old wptr
0xd20, new wptr 0x840<br>
[ 48.808421] amdgpu_ih_process:254: amdgpu: rptr 0x880, old wptr
0xd20, new wptr 0x840<br>
</p>
<p>Cause this gpu vm fault dump because address is unmapped from
cpu.<br>
</p>
<p>[ 48.807071] svm_range_restore_pages:2617: amdgpu: restoring
svms 0x00000000733bf007 fault address 0x7f8a6991f</p>
<p>[ 48.807170] svm_range_restore_pages:2631: amdgpu: failed to
find prange svms 0x00000000733bf007 address [0x7f8a6991f]<br>
[ 48.807179] svm_range_get_range_boundaries:2348: amdgpu: VMA
does not exist in address [0x7f8a6991f]<br>
[ 48.807185] svm_range_restore_pages:2635: amdgpu: failed to
create unregistered range svms 0x00000000733bf007 address
[0x7f8a6991f]</p>
<p>[ 48.807929] amdgpu 0000:25:00.0: amdgpu: [mmhub0] retry page
fault (src_id:0 ring:0 vmid:8 pasid:32770, for process kfdtest pid
3969 thread kfdtest pid 3969)<br>
[ 48.808219] amdgpu 0000:25:00.0: amdgpu: in page starting at
address 0x00007f8a6991f000 from IH client 0x12 (VMC)<br>
[ 48.808230] amdgpu 0000:25:00.0: amdgpu:
VM_L2_PROTECTION_FAULT_STATUS:0x00800031<br>
<br>
</p>
<blockquote type="cite" cite="mid:1862b795-3401-b89f-089b-4b544957d150@gmail.com">We could
of course parameterize that so that we check the wptr after each
IV on IH1, but please not hard coded like this.
<br>
<br>
Regards,
<br>
Christian.
<br>
<br>
<blockquote type="cite"> }
<br>
amdgpu_ih_set_rptr(adev, ih);
<br>
</blockquote>
<br>
</blockquote>
</body>
</html>