<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Hi Monk,<br>
<br>
in general an interesting idea, but I see two major problems with
that:<br>
<br>
1. It would make the reset take much longer.<br>
<br>
2. Things get often stuck because of timing issues, so a guilty job
might pass perfectly when run a second time.<br>
<br>
Apart from that the whole ring mirror list turned out to be a really
bad idea. E.g. we still struggle with object life time because the
concept doesn't fit into the object model of the GPU scheduler under
Linux.<br>
<br>
We should probably work on this separately and straighten up the job
destruction once more and keep the recovery information in the fence
instead.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<div class="moz-cite-prefix">Am 26.02.21 um 06:58 schrieb Liu, Monk:<br>
</div>
<blockquote type="cite" cite="mid:DM5PR12MB1708D28565B445EABA872A3B849D9@DM5PR12MB1708.namprd12.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:503861270;
mso-list-type:hybrid;
mso-list-template-ids:1492292582 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1
{mso-list-id:1279491622;
mso-list-type:hybrid;
mso-list-template-ids:-1736673670 67698703 67698689 67698703 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level2
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l1:level3
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-9.0pt;}
@list l1:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2
{mso-list-id:1655448059;
mso-list-type:hybrid;
mso-list-template-ids:-1584207202 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l2:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<p class="msipheader251902e5" style="margin:0" align="Left"><span style="font-size:10.0pt;font-family:Arial;color:#317100">[AMD
Public Use]</span></p>
<br>
<div class="WordSection1">
<p class="MsoNormal">Hi all<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">NAVI2X project hit a really hard to solve
issue now, and it is turned out to be a general headache of
our TDR mechanism , check below scenario:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" type="1" start="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">There is a
job1 running on compute1 ring at timestamp
<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">There is a
job2 running on gfx ring at timestamp<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Job1 is the
guilty one, and job1/job2 were scheduled to their rings at
almost the same timestamp
<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">After 2
seconds we receive two TDR reporting from both GFX ring and
compute ring<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><b>Current
scheme is that in drm scheduler all the head jobs of those
two rings are considered “bad job” and taken away from the
mirror list
<o:p></o:p></b></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">The result
is both the real guilty job (job1) and the innocent job
(job2) were all deleted from mirror list, and their
corresponding contexts were also treated as guilty<b> (so
the innocent process remains running is not secured)<o:p></o:p></b></li>
</ol>
<p class="MsoListParagraph"><b><o:p> </o:p></b></p>
<p class="MsoNormal">But by our wish the ideal case is TDR
mechanism can detect which ring is the guilty ring and the
innocent ring can resubmits all its pending jobs:<o:p></o:p></p>
<ol style="margin-top:0in" type="1" start="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">Job1 to be
deleted from compute1 ring’s mirror list<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">Job2 is kept
and resubmitted later and its belonging process/context are
even not aware of this TDR at all
<o:p></o:p></li>
</ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Here I have a proposal tend to achieve
above goal and it rough procedure is :<o:p></o:p></p>
<ol style="margin-top:0in" type="1" start="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">Once any
ring reports a TDR, the head job is *<b>not</b>* treated as
“bad job”, and it is *<b>not</b>* deleted from the mirror
list in drm sched functions<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">In vendor’s
function (our amdgpu driver here):<o:p></o:p></li>
<ul style="margin-top:0in" type="disc">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">reset GPU<o:p></o:p></li>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">repeat
below actions on each RINGS * one by one *:<o:p></o:p></li>
</ul>
</ol>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3
lfo3">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->take the head job and submit it
on this ring<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3
lfo3">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->see if it completes, if not then
this job is the real “bad job”<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3
lfo3">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]--> take it away from mirror list
if this head job is “bad job”<o:p></o:p></p>
<ol style="margin-top:0in" type="1" start="2">
<ul style="margin-top:0in" type="disc">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">After
above iteration on all RINGS, we already clears all the
bad job(s)<o:p></o:p></li>
</ul>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">Resubmit all
jobs from each mirror list to their corresponding rings
(this is the existed logic)<o:p></o:p></li>
</ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The idea of this is to use “serial” way to
re-run and re-check each head job of each RING, in order to
take out the real black sheep and its guilty context.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">P.S.: we can use this approaches only on
GFX/KCQ ring reports TDR , since those rings are intermutually
affected to each other. For SDMA ring timeout it definitely
proves the head job on SDMA ring is really guilty.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks <o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">------------------------------------------<o:p></o:p></p>
<p class="MsoNormal">Monk Liu | Cloud-GPU Core team<o:p></o:p></p>
<p class="MsoNormal">------------------------------------------<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</blockquote>
<br>
</body>
</html>