<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Well, it shouldn't happen with the hive locked as I am browsing
the code but then your code should<br>
reflect that and if you do fail to lock particular adev AFTER the
hive is locked you should not silently break<br>
iteration but throw an error, WARN_ON or BUG_ON then. Or
alternatively bail out with unlocking all already<br>
locked devices.<br>
</p>
<p><br>
</p>
<p>Andrey<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 1/19/21 12:09 PM, Chen, Horace
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CY4PR12MB1573CFBA40A3D36BF69C32FFE1A30@CY4PR12MB1573.namprd12.prod.outlook.com">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
<p style="font-family:Arial;font-size:11pt;color:#0078D7;margin:5pt;" align="Left">
[AMD Official Use Only - Internal Distribution Only]<br>
</p>
<br>
<div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
OK, I understand. You mean one device in the hive may be
locked up independently without locking up the whole hive.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
It could happen, I'll change my code.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Thanks & Regards,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Horace.<br>
</div>
<div>
<div id="appendonsend"><br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>发件人:</b>
Grodzovsky, Andrey <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a><br>
<b>发送时间:</b> 2021年1月20日 0:58<br>
<b>收件人:</b> Chen, Horace <a class="moz-txt-link-rfc2396E" href="mailto:Horace.Chen@amd.com"><Horace.Chen@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org"><amd-gfx@lists.freedesktop.org></a><br>
<b>抄送:</b> Quan, Evan <a class="moz-txt-link-rfc2396E" href="mailto:Evan.Quan@amd.com"><Evan.Quan@amd.com></a>; Tuikov,
Luben <a class="moz-txt-link-rfc2396E" href="mailto:Luben.Tuikov@amd.com"><Luben.Tuikov@amd.com></a>; Koenig, Christian
<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Deucher, Alexander
<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>; Xiao, Jack
<a class="moz-txt-link-rfc2396E" href="mailto:Jack.Xiao@amd.com"><Jack.Xiao@amd.com></a>; Zhang, Hawking
<a class="moz-txt-link-rfc2396E" href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a>; Liu, Monk
<a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>; Xu, Feifei
<a class="moz-txt-link-rfc2396E" href="mailto:Feifei.Xu@amd.com"><Feifei.Xu@amd.com></a>; Wang, Kevin(Yang)
<a class="moz-txt-link-rfc2396E" href="mailto:Kevin1.Wang@amd.com"><Kevin1.Wang@amd.com></a>; Xiaojie Yuan
<a class="moz-txt-link-rfc2396E" href="mailto:xiaojie.yuan@amd.com"><xiaojie.yuan@amd.com></a><br>
<b>主题:</b> Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when
jobs on 2 ring timeout</font>
<div> </div>
</div>
<div>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<div class="x_moz-cite-prefix">On 1/19/21 11:39 AM, Chen,
Horace wrote:<br>
</div>
<blockquote type="cite">
<p style="margin-top: 0px; margin-bottom:
0px;font-family:Arial; font-size:11pt; color:#0078D7;
margin:5pt" align="Left">
[AMD Official Use Only - Internal Distribution Only]<br>
</p>
<br>
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
Hi Andrey,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
I think the list in the XGMI hive won't be break in
the middle if we lock the device before we change the
list. Because if 2 devices in 1 hive went into the
function, it will follow the same sequence to lock the
devices. So one of them will definately break at the
first device. I add iterate devices here is just to
lock all device in the hive since we will change the
device sequence in the hive soon after.</div>
</div>
</blockquote>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<p style="margin-top: 0px; margin-bottom: 0px;">I didn't
mean break in a sense of breaking the list itself, I just
meant the literal 'break' instruction<br>
to terminate the iteration once you failed to lock a
particular device. <br>
</p>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<blockquote type="cite">
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
The reason to break the interation in the middle is
that the list is changed during the iteration without
taking any lock. It is quite bad since I'm fixing one
of this issue. And for XGMI hive, there are 2 locks
protecting the list, one is the device lock I changed
here, the other one is in front of my change, there is
a hive->lock to protect the hive.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
Even the bad thing really happened, I think moving
back through the list is also very dengerous since we
don't know what the list finally be, Unless we stack
the devices we have iterated through a mirrored list.
That can be a big change.<br>
</div>
</div>
</blockquote>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<p style="margin-top: 0px; margin-bottom: 0px;">Not sure we
are on the same page, my concern is let's sat your XGMI
hive consists of 2 devices, you manged to call
successfully do<font size="2"><span style="font-size:11pt"><br>
amdgpu_device_lock_adev for dev1 but then failed for
dev2, in this case you will bail out </span></font>
without releasing dev1, no ?</p>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<p style="margin-top: 0px; margin-bottom: 0px;">Andrey<br>
</p>
<p style="margin-top: 0px; margin-bottom: 0px;"><br>
</p>
<blockquote type="cite">
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
I'm ok to seperate the locking in <font size="2"><span style="font-size:11pt">amdgpu_device_lock_adev
here, I'll do some test and update the code later.<br>
</span></font></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<font size="2"><span style="font-size:11pt"><br>
</span></font></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<font size="2"><span style="font-size:11pt">Thanks
& Regards,</span></font></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0)">
<font size="2"><span style="font-size:11pt">Horace.</span></font><br>
</div>
<div>
<hr tabindex="-1" style="display:inline-block;
width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>发件人:</b> Grodzovsky, Andrey
<a class="x_moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true"><Andrey.Grodzovsky@amd.com></a><br>
<b>发送时间:</b> 2021年1月19日 22:33<br>
<b>收件人:</b> Chen, Horace <a class="x_moz-txt-link-rfc2396E" href="mailto:Horace.Chen@amd.com" moz-do-not-send="true">
<Horace.Chen@amd.com></a>; <a class="x_moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">
amd-gfx@lists.freedesktop.org</a> <a class="x_moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">
<amd-gfx@lists.freedesktop.org></a><br>
<b>抄送:</b> Quan, Evan <a class="x_moz-txt-link-rfc2396E" href="mailto:Evan.Quan@amd.com" moz-do-not-send="true">
<Evan.Quan@amd.com></a>; Tuikov, Luben <a class="x_moz-txt-link-rfc2396E" href="mailto:Luben.Tuikov@amd.com" moz-do-not-send="true">
<Luben.Tuikov@amd.com></a>; Koenig,
Christian <a class="x_moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
<Christian.Koenig@amd.com></a>; Deucher,
Alexander <a class="x_moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">
<Alexander.Deucher@amd.com></a>; Xiao,
Jack <a class="x_moz-txt-link-rfc2396E" href="mailto:Jack.Xiao@amd.com" moz-do-not-send="true">
<Jack.Xiao@amd.com></a>; Zhang, Hawking <a class="x_moz-txt-link-rfc2396E" href="mailto:Hawking.Zhang@amd.com" moz-do-not-send="true">
<Hawking.Zhang@amd.com></a>; Liu, Monk <a class="x_moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">
<Monk.Liu@amd.com></a>; Xu, Feifei <a class="x_moz-txt-link-rfc2396E" href="mailto:Feifei.Xu@amd.com" moz-do-not-send="true">
<Feifei.Xu@amd.com></a>; Wang, Kevin(Yang)
<a class="x_moz-txt-link-rfc2396E" href="mailto:Kevin1.Wang@amd.com" moz-do-not-send="true">
<Kevin1.Wang@amd.com></a>; Xiaojie Yuan <a class="x_moz-txt-link-rfc2396E" href="mailto:xiaojie.yuan@amd.com" moz-do-not-send="true">
<xiaojie.yuan@amd.com></a><br>
<b>主题:</b> Re: [PATCH 1/2] drm/amdgpu: race issue
when jobs on 2 ring timeout</font>
<div> </div>
</div>
<div class="x_BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="x_PlainText"><br>
On 1/19/21 7:22 AM, Horace Chen wrote:<br>
> Fix a racing issue when jobs on 2 rings
timeout simultaneously.<br>
><br>
> If 2 rings timed out at the same time,
the amdgpu_device_gpu_recover<br>
> will be reentered. Then the
adev->gmc.xgmi.head will be grabbed<br>
> by 2 local linked list, which may cause
wild pointer issue in<br>
> iterating.<br>
><br>
> lock the device earily to prevent the
node be added to 2 different<br>
> lists.<br>
><br>
> Signed-off-by: Horace Chen <a class="x_moz-txt-link-rfc2396E" href="mailto:horace.chen@amd.com" moz-do-not-send="true">
<horace.chen@amd.com></a><br>
> ---<br>
>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
42 +++++++++++++++-------<br>
> 1 file changed, 30 insertions(+), 12
deletions(-)<br>
><br>
> diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> index 4d434803fb49..9574da3abc32 100644<br>
> ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> @@ -4540,6 +4540,7 @@ int
amdgpu_device_gpu_recover(struct amdgpu_device
*adev,<br>
> int i, r = 0;<br>
> bool need_emergency_restart =
false;<br>
> bool audio_suspended = false;<br>
> + bool get_dev_lock = false;<br>
> <br>
> /*<br>
> * Special case: RAS triggered and
full reset isn't supported<br>
> @@ -4582,28 +4583,45 @@ int
amdgpu_device_gpu_recover(struct amdgpu_device
*adev,<br>
> * Build list of devices to reset.<br>
> * In case we are in XGMI hive
mode, resort the device list<br>
> * to put adev in the 1st
position.<br>
> + *<br>
> + * lock the device before we try to
operate the linked list<br>
> + * if didn't get the device lock,
don't touch the linked list since<br>
> + * others may iterating it.<br>
> */<br>
> INIT_LIST_HEAD(&device_list);<br>
> if
(adev->gmc.xgmi.num_physical_nodes > 1)
{<br>
> if (!hive)<br>
> return -ENODEV;<br>
> - if
(!list_is_first(&adev->gmc.xgmi.head,
&hive->device_list))<br>
> -
list_rotate_to_front(&adev->gmc.xgmi.head,
&hive->device_list);<br>
> - device_list_handle =
&hive->device_list;<br>
> +<br>
> +
list_for_each_entry(tmp_adev,
&hive->device_list, gmc.xgmi.head) {<br>
> + get_dev_lock =
amdgpu_device_lock_adev(tmp_adev, hive);<br>
> + if (!get_dev_lock)<br>
> + break;<br>
<br>
<br>
What about unlocking back all the devices you
already locked if the break<br>
happens in the middle of the iteration ?<br>
Note that at skip_recovery: we don't do it.
BTW, i see this issue is already in <br>
the current code.<br>
<br>
Also, maybe now it's better to separate the
actual locking in <br>
amdgpu_device_lock_adev<br>
from the other stuff going on there since I
don't think you would wont to toggle <br>
stuff<br>
like adev->mp1_state back and forth and
also the function name is not <br>
descriptive of<br>
the other stuff going on there anyway.<br>
<br>
Andrey<br>
<br>
<br>
> + }<br>
> + if (get_dev_lock) {<br>
> + if
(!list_is_first(&adev->gmc.xgmi.head,
&hive->device_list))<br>
> +
list_rotate_to_front(&adev->gmc.xgmi.head,
&hive->device_list);<br>
> + device_list_handle
= &hive->device_list;<br>
> + }<br>
> } else {<br>
> -
list_add_tail(&adev->gmc.xgmi.head,
&device_list);<br>
> - device_list_handle =
&device_list;<br>
> + get_dev_lock =
amdgpu_device_lock_adev(adev, hive);<br>
> + tmp_adev = adev;<br>
> + if (get_dev_lock) {<br>
> +
list_add_tail(&adev->gmc.xgmi.head,
&device_list);<br>
> + device_list_handle
= &device_list;<br>
> + }<br>
> + }<br>
> +<br>
> + if (!get_dev_lock) {<br>
> + dev_info(tmp_adev->dev,
"Bailing on TDR for s_job:%llx, as another
already in progress",<br>
> + job
? job->base.id : -1);<br>
> + r = 0;<br>
> + /* even we skipped this
reset, still need to set the job to guilty */<br>
> + goto skip_recovery;<br>
> }<br>
> <br>
> /* block all schedulers and reset
given job's ring */<br>
> list_for_each_entry(tmp_adev,
device_list_handle, gmc.xgmi.head) {<br>
> - if
(!amdgpu_device_lock_adev(tmp_adev, hive)) {<br>
> -
dev_info(tmp_adev->dev, "Bailing on TDR for
s_job:%llx, as another already in progress",<br>
> - job ?
job->base.id : -1);<br>
> - r = 0;<br>
> - goto skip_recovery;<br>
> - }<br>
> -<br>
> /*<br>
> * Try to put the audio
codec into suspend state<br>
> * before gpu reset
started.<br>
</div>
</span></font></div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</body>
</html>