<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:503861270;
mso-list-type:hybrid;
mso-list-template-ids:1492292582 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1
{mso-list-id:1279491622;
mso-list-type:hybrid;
mso-list-template-ids:-1736673670 67698703 67698689 67698703 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level2
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l1:level3
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-9.0pt;}
@list l1:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2
{mso-list-id:1655448059;
mso-list-type:hybrid;
mso-list-template-ids:-1584207202 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l2:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<p class="msipheader251902e5" align="Left" style="margin:0"><span style="font-size:10.0pt;font-family:Arial;color:#317100">[AMD Public Use]</span></p>
<br>
<div class="WordSection1">
<p class="MsoNormal">Hi all<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">NAVI2X project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">There is a job1 running on compute1 ring at timestamp
<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">There is a job2 running on gfx ring at timestamp<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">After 2 seconds we receive two TDR reporting from both GFX ring and compute ring<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><b>Current scheme is that in drm scheduler all the head jobs of those two rings are considered “bad job” and taken away from the mirror list
<o:p></o:p></b></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty<b> (so the
innocent process remains running is not secured)<o:p></o:p></b></li></ol>
<p class="MsoListParagraph"><b><o:p> </o:p></b></p>
<p class="MsoNormal">But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">Job1 to be deleted from compute1 ring’s mirror list<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all
<o:p></o:p></li></ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Here I have a proposal tend to achieve above goal and it rough procedure is :<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">Once any ring reports a TDR, the head job is *<b>not</b>* treated as “bad job”, and it is *<b>not</b>* deleted from the mirror list in drm sched functions<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">In vendor’s function (our amdgpu driver here):<o:p></o:p></li><ul style="margin-top:0in" type="disc">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">reset GPU<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">repeat below actions on each RINGS * one by one *:<o:p></o:p></li></ul>
</ol>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3 lfo3">
<![if !supportLists]><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>take the head job and submit it on this ring<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3 lfo3">
<![if !supportLists]><span style="mso-list:Ignore">2.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>see if it completes, if not then this job is the real “bad job”<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.5in;text-indent:-9.0pt;mso-list:l1 level3 lfo3">
<![if !supportLists]><span style="mso-list:Ignore">3.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]> take it away from mirror list if this head job is “bad job”<o:p></o:p></p>
<ol style="margin-top:0in" start="2" type="1">
<ul style="margin-top:0in" type="disc">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level2 lfo3">After above iteration on all RINGS, we already clears all the bad job(s)<o:p></o:p></li></ul>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)<o:p></o:p></li></ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks <o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">------------------------------------------<o:p></o:p></p>
<p class="MsoNormal">Monk Liu | Cloud-GPU Core team<o:p></o:p></p>
<p class="MsoNormal">------------------------------------------<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>