<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"MS Gothic";
panose-1:2 11 6 9 7 2 5 8 2 4;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Microsoft JhengHei";
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:"\@Microsoft JhengHei";}
@font-face
{font-family:"\@MS Gothic";
panose-1:2 11 6 9 7 2 5 8 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
p.msipheader251902e5, li.msipheader251902e5, div.msipheader251902e5
{mso-style-name:msipheader251902e5;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="msipheader251902e5" style="margin:0in"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">What about changing the lock hive logic like <o:p></o:p></p>
<p class="MsoNormal"> If (this device locked) return;<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:.5in">Lock hive -> lock this device.<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="text-indent:.5in">In the regular flow, lock every thing in the list except this device.
<o:p></o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">Lijo<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org>
<b>On Behalf Of </b>Andrey Grodzovsky<br>
<b>Sent:</b> Tuesday, January 19, 2021 10:45 PM<br>
<b>To:</b> Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org<br>
<b>Cc:</b> Xiao, Jack <Jack.Xiao@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>;
Liu, Monk <Monk.Liu@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com><br>
<b>Subject:</b> Re: <span style="font-family:"MS Gothic"">回复</span>: <span style="font-family:"MS Gothic"">
回复</span>: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p>Well, it shouldn't happen with the hive locked as I am browsing the code but then your code should<br>
reflect that and if you do fail to lock particular adev AFTER the hive is locked you should not silently break<br>
iteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail out with unlocking all already<br>
locked devices.<o:p></o:p></p>
<p><o:p> </o:p></p>
<p>Andrey<o:p></o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 1/19/21 12:09 PM, Chen, Horace wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p style="margin:5.0pt"><span style="font-family:"Arial",sans-serif;color:#0078D7">[AMD Official Use Only - Internal Distribution Only]<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">OK, I understand. You mean one device in the hive may be locked up independently without locking up the whole hive.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">It could happen, I'll change my code.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Thanks & Regards,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Horace.<o:p></o:p></span></p>
</div>
<div>
<div id="appendonsend">
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div class="MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal"><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">发件人</span><span style="color:black">:</span></b><span style="color:black"> Grodzovsky, Andrey
<a href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a><br>
</span><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">发送时间</span><span style="color:black">:</span></b><span style="color:black"> 2021</span><span style="font-family:"MS Gothic";color:black">年</span><span style="color:black">1</span><span style="font-family:"MS Gothic";color:black">月</span><span style="color:black">20</span><span style="font-family:"MS Gothic";color:black">日</span><span style="color:black">
0:58<br>
</span><b><span style="font-family:"MS Gothic";color:black">收件人</span><span style="color:black">:</span></b><span style="color:black"> Chen, Horace
<a href="mailto:Horace.Chen@amd.com"><Horace.Chen@amd.com></a>; <a href="mailto:amd-gfx@lists.freedesktop.org">
amd-gfx@lists.freedesktop.org</a> <a href="mailto:amd-gfx@lists.freedesktop.org">
<amd-gfx@lists.freedesktop.org></a><br>
</span><b><span style="font-family:"MS Gothic";color:black">抄送</span><span style="color:black">:</span></b><span style="color:black"> Quan, Evan
<a href="mailto:Evan.Quan@amd.com"><Evan.Quan@amd.com></a>; Tuikov, Luben <a href="mailto:Luben.Tuikov@amd.com">
<Luben.Tuikov@amd.com></a>; Koenig, Christian <a href="mailto:Christian.Koenig@amd.com">
<Christian.Koenig@amd.com></a>; Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com">
<Alexander.Deucher@amd.com></a>; Xiao, Jack <a href="mailto:Jack.Xiao@amd.com"><Jack.Xiao@amd.com></a>; Zhang, Hawking
<a href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a>; Liu, Monk <a href="mailto:Monk.Liu@amd.com">
<Monk.Liu@amd.com></a>; Xu, Feifei <a href="mailto:Feifei.Xu@amd.com"><Feifei.Xu@amd.com></a>; Wang, Kevin(Yang)
<a href="mailto:Kevin1.Wang@amd.com"><Kevin1.Wang@amd.com></a>; Xiaojie Yuan <a href="mailto:xiaojie.yuan@amd.com">
<xiaojie.yuan@amd.com></a><br>
</span><b><span style="font-family:"MS Gothic";color:black">主</span></b><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">题</span><span style="color:black">:</span></b><span style="color:black"> Re:
</span><span style="font-family:"MS Gothic";color:black">回复</span><span style="color:black">: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 1/19/21 11:39 AM, Chen, Horace wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p style="margin:5.0pt"><span style="font-family:"Arial",sans-serif;color:#0078D7">[AMD Official Use Only - Internal Distribution Only]<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Hi Andrey,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">I think the list in the XGMI hive won't be break in the middle if we lock the device before we change the list. Because if 2 devices in 1 hive went into the function, it will follow the same sequence
to lock the devices. So one of them will definately break at the first device. I add iterate devices here is just to lock all device in the hive since we will change the device sequence in the hive soon after.<o:p></o:p></span></p>
</div>
</div>
</blockquote>
<p><o:p> </o:p></p>
<p>I didn't mean break in a sense of breaking the list itself, I just meant the literal 'break' instruction<br>
to terminate the iteration once you failed to lock a particular device. <o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">The reason to break the interation in the middle is that the list is changed during the iteration without taking any lock. It is quite bad since I'm fixing one of this issue. And for XGMI hive,
there are 2 locks protecting the list, one is the device lock I changed here, the other one is in front of my change, there is a hive->lock to protect the hive.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Even the bad thing really happened, I think moving back through the list is also very dengerous since we don't know what the list finally be, Unless we stack the devices we have iterated through
a mirrored list. That can be a big change.<o:p></o:p></span></p>
</div>
</div>
</blockquote>
<p><o:p> </o:p></p>
<p>Not sure we are on the same page, my concern is let's sat your XGMI hive consists of 2 devices, you manged to call successfully do<br>
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will bail out without releasing dev1, no ?<o:p></o:p></p>
<p><o:p> </o:p></p>
<p>Andrey<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">I'm ok to seperate the locking in
</span><span style="color:black">amdgpu_device_lock_adev here, I'll do some test and update the code later.</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:black">Thanks & Regards,</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:black">Horace.</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<div class="MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_divRplyFwdMsg">
<p class="MsoNormal"><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">发件人</span><span style="color:black">:</span></b><span style="color:black"> Grodzovsky, Andrey
<a href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a><br>
</span><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">发送时间</span><span style="color:black">:</span></b><span style="color:black"> 2021</span><span style="font-family:"MS Gothic";color:black">年</span><span style="color:black">1</span><span style="font-family:"MS Gothic";color:black">月</span><span style="color:black">19</span><span style="font-family:"MS Gothic";color:black">日</span><span style="color:black">
22:33<br>
</span><b><span style="font-family:"MS Gothic";color:black">收件人</span><span style="color:black">:</span></b><span style="color:black"> Chen, Horace
<a href="mailto:Horace.Chen@amd.com"><Horace.Chen@amd.com></a>; <a href="mailto:amd-gfx@lists.freedesktop.org">
amd-gfx@lists.freedesktop.org</a> <a href="mailto:amd-gfx@lists.freedesktop.org">
<amd-gfx@lists.freedesktop.org></a><br>
</span><b><span style="font-family:"MS Gothic";color:black">抄送</span><span style="color:black">:</span></b><span style="color:black"> Quan, Evan
<a href="mailto:Evan.Quan@amd.com"><Evan.Quan@amd.com></a>; Tuikov, Luben <a href="mailto:Luben.Tuikov@amd.com">
<Luben.Tuikov@amd.com></a>; Koenig, Christian <a href="mailto:Christian.Koenig@amd.com">
<Christian.Koenig@amd.com></a>; Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com">
<Alexander.Deucher@amd.com></a>; Xiao, Jack <a href="mailto:Jack.Xiao@amd.com"><Jack.Xiao@amd.com></a>; Zhang, Hawking
<a href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a>; Liu, Monk <a href="mailto:Monk.Liu@amd.com">
<Monk.Liu@amd.com></a>; Xu, Feifei <a href="mailto:Feifei.Xu@amd.com"><Feifei.Xu@amd.com></a>; Wang, Kevin(Yang)
<a href="mailto:Kevin1.Wang@amd.com"><Kevin1.Wang@amd.com></a>; Xiaojie Yuan <a href="mailto:xiaojie.yuan@amd.com">
<xiaojie.yuan@amd.com></a><br>
</span><b><span style="font-family:"MS Gothic";color:black">主</span></b><b><span style="font-family:"Microsoft JhengHei",sans-serif;color:black">题</span><span style="color:black">:</span></b><span style="color:black"> Re: [PATCH 1/2] drm/amdgpu: race issue
when jobs on 2 ring timeout</span> <o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"><br>
On 1/19/21 7:22 AM, Horace Chen wrote:<br>
> Fix a racing issue when jobs on 2 rings timeout simultaneously.<br>
><br>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover<br>
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed<br>
> by 2 local linked list, which may cause wild pointer issue in<br>
> iterating.<br>
><br>
> lock the device earily to prevent the node be added to 2 different<br>
> lists.<br>
><br>
> Signed-off-by: Horace Chen <a href="mailto:horace.chen@amd.com"><horace.chen@amd.com></a><br>
> ---<br>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------<br>
> 1 file changed, 30 insertions(+), 12 deletions(-)<br>
><br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> index 4d434803fb49..9574da3abc32 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
> int i, r = 0;<br>
> bool need_emergency_restart = false;<br>
> bool audio_suspended = false;<br>
> + bool get_dev_lock = false;<br>
> <br>
> /*<br>
> * Special case: RAS triggered and full reset isn't supported<br>
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
> * Build list of devices to reset.<br>
> * In case we are in XGMI hive mode, resort the device list<br>
> * to put adev in the 1st position.<br>
> + *<br>
> + * lock the device before we try to operate the linked list<br>
> + * if didn't get the device lock, don't touch the linked list since<br>
> + * others may iterating it.<br>
> */<br>
> INIT_LIST_HEAD(&device_list);<br>
> if (adev->gmc.xgmi.num_physical_nodes > 1) {<br>
> if (!hive)<br>
> return -ENODEV;<br>
> - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))<br>
> - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);<br>
> - device_list_handle = &hive->device_list;<br>
> +<br>
> + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {<br>
> + get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);<br>
> + if (!get_dev_lock)<br>
> + break;<br>
<br>
<br>
What about unlocking back all the devices you already locked if the break<br>
happens in the middle of the iteration ?<br>
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in <br>
the current code.<br>
<br>
Also, maybe now it's better to separate the actual locking in <br>
amdgpu_device_lock_adev<br>
from the other stuff going on there since I don't think you would wont to toggle <br>
stuff<br>
like adev->mp1_state back and forth and also the function name is not <br>
descriptive of<br>
the other stuff going on there anyway.<br>
<br>
Andrey<br>
<br>
<br>
> + }<br>
> + if (get_dev_lock) {<br>
> + if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))<br>
> + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);<br>
> + device_list_handle = &hive->device_list;<br>
> + }<br>
> } else {<br>
> - list_add_tail(&adev->gmc.xgmi.head, &device_list);<br>
> - device_list_handle = &device_list;<br>
> + get_dev_lock = amdgpu_device_lock_adev(adev, hive);<br>
> + tmp_adev = adev;<br>
> + if (get_dev_lock) {<br>
> + list_add_tail(&adev->gmc.xgmi.head, &device_list);<br>
> + device_list_handle = &device_list;<br>
> + }<br>
> + }<br>
> +<br>
> + if (!get_dev_lock) {<br>
> + dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",<br>
> + job ? job->base.id : -1);<br>
> + r = 0;<br>
> + /* even we skipped this reset, still need to set the job to guilty */<br>
> + goto skip_recovery;<br>
> }<br>
> <br>
> /* block all schedulers and reset given job's ring */<br>
> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {<br>
> - if (!amdgpu_device_lock_adev(tmp_adev, hive)) {<br>
> - dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",<br>
> - job ? job->base.id : -1);<br>
> - r = 0;<br>
> - goto skip_recovery;<br>
> - }<br>
> -<br>
> /*<br>
> * Try to put the audio codec into suspend state<br>
> * before gpu reset started.<o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</body>
</html>