<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Am 12.10.2017 um 13:37 schrieb Liu,
Monk:<br>
</div>
<blockquote type="cite"
cite="mid:BLUPR12MB0449B1E12216845D4B52FC63844B0@BLUPR12MB0449.namprd12.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from rtf -->
<style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
<font size="2" face="Calibri"><span style="font-size:10.5pt;">
<div>Hi team</div>
<div><font face="Times New Roman"> </font></div>
<div>Very good, many policy and implement are agreed, looks we
only have some arguments in amdgpu_ctx_query(), well I also
confused with the current implement of it, <font
face="Segoe UI Emoji">☹</font></div>
<div><font face="Times New Roman"> </font></div>
<div>First, I want to know if you guys agree that we<b> don't
update ctx->reset_counter in amdgpu_ctx_query() </b>?
because I want to make the query result always consistent
upon a given context<font face="Courier New">,</font></div>
</span></font></blockquote>
<br>
That sounds like a good idea to me, but I'm not sure if it won't
break userspace (I don't think so). Nicolai or Marek need to
comment.<br>
<br>
<blockquote type="cite"
cite="mid:BLUPR12MB0449B1E12216845D4B52FC63844B0@BLUPR12MB0449.namprd12.prod.outlook.com"><font
size="2" face="Calibri"><span style="font-size:10.5pt;">
<div><font face="Times New Roman"> </font></div>
<div>Second, I want to say that for KERNEL, it shouldn't use
the term from MESA or OGL or VULKAN, e.g. kernel shouldn't
use AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB,
etc.<font face="Courier New">..</font></div>
<div>Because that way kernel will be very limited to certain
UMD, so I suggest we totally re-name the context status, and
each UMD has its own way to map the kernel context's result
to gl-context/vk-context/etc<font face="Courier New">…</font></div>
</span></font></blockquote>
<br>
Yes, completely agree.<br>
<br>
<blockquote type="cite"
cite="mid:BLUPR12MB0449B1E12216845D4B52FC63844B0@BLUPR12MB0449.namprd12.prod.outlook.com"><font
size="2" face="Calibri"><span style="font-size:10.5pt;">
<div><font face="Times New Roman"> </font></div>
<div>Kernel should only provide below *<b>FLAG bits</b>* on a
given context:</div>
<div><font face="Times New Roman"> </font></div>
<ul style="margin:0;padding-left:21pt;">
<li>Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long
as TDR detects a job hang, KMD set the context behind this
context as "guilty"</li>
<li>Define AMDGPU_CTX_STATE_VRAMLOST 0x2 //as
long as there is a VRAM lost event hit after this context
created, we mark this context "VRAM_LOST", so UMD can say
that all BO under this context may lost their content,
since kernel have no relationship
between context and BO so this is UMD's call to judge if a
BO considered "VRAM lost" or not.</li>
<li>Define AMDGPU_CTX_STATE_RESET 0x3 //as long as
there is a gpu reset occurred after context creation, this
flag shall be set</li>
</ul>
</span></font></blockquote>
<br>
That sounds sane, but unfortunately might not be possible with the
existing IOCTL. Keep in mind that we need to keep backward
compatibility here. <br>
<br>
<blockquote type="cite"
cite="mid:BLUPR12MB0449B1E12216845D4B52FC63844B0@BLUPR12MB0449.namprd12.prod.outlook.com"><font
size="2" face="Calibri"><span style="font-size:10.5pt;">
<div><font face="Times New Roman"> </font></div>
<div>Sample code:</div>
<div> </div>
<div>Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out,
<font face="Courier New">…</font><font face="Courier New">..</font>)
{</div>
<div> if (ctx- >vram_lost_counter !=
adev->vram_lost_counter)</div>
<div> out- >status |=
AMDGPU_CTX_STATE_VRAM_LOST;</div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> <font
face="Calibri">if (ctx- >reset_counter != adev</font><font
face="等线">→</font><font face="Calibri">reset_counter)</font><font
face="Calibri"> {</font></font></div>
<div><font face="Times New Roman"> <font
face="Calibri">out- >status |=
AMDGPU_CTX_STATE_RESET;</font></font></div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> <font
face="Calibri">if (ctx- >guilty == TRUE)</font></font></div>
<div> out- >status |=
AMDGPU_CTX_STATE_GUILTY;</div>
<div><font face="Times New Roman"> <font face="Courier
New">}</font></font></div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> <font
face="Calibri">return 0;</font></font></div>
<div>}</div>
<div><font face="Times New Roman"> </font></div>
<div>For UMD if it found "out.status == 0" means there is no
gpu reset even occurred, the context is totally regular<font
face="Courier New">,</font></div>
<div><font face="Times New Roman"> </font></div>
<ul style="margin:0;padding-left:21pt;">
<li><b>A</b><b> </b><b>new IOCTL added for context:</b></li>
</ul>
<div style="padding-left:21pt;"><font face="Times New Roman"> </font></div>
<div>Void amdgpu_ctx_reinit(){</div>
<div> Ctx<font face="等线">→</font>vram_lost_counter =
adev<font face="等线">→</font>vram_lost_counter;</div>
<div> Ctx<font face="等线">→</font>reset_counter = adev<font
face="等线">→</font>reset_counter;</div>
<div><font face="Courier New">}</font></div>
</span></font></blockquote>
<br>
Mhm, is there any advantage to just creating a new context?<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite"
cite="mid:BLUPR12MB0449B1E12216845D4B52FC63844B0@BLUPR12MB0449.namprd12.prod.outlook.com"><font
size="2" face="Calibri"><span style="font-size:10.5pt;">
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> </font></div>
<div><b>if</b><b> UMD decide *not* to release the "guilty"
context but continue using it </b><b>after</b><b> UMD
acknowledged GPU hang </b><b>on certain job/context, I
suggest </b><b>UMD call "amdgpu_ctx_reinit()":</b></div>
<div><font face="Times New Roman"> </font></div>
<div>That way after you re<font face="Courier New">-</font>init()
this context, you can get updated result from
"amdgpu_ctx_query", which will probably give you "out.status
== 0" as long as no gpu reset/vram lost hit after re-init()<font
face="Courier New">.</font></div>
<div><font face="Times New Roman"> </font></div>
<div>BR Monk</div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> </font></div>
<div>-----Original Message-----<br>
From: Koenig, Christian <br>
Sent: 2017<font face="等线">年</font>10<font face="等线">月</font>12<font
face="等线">日</font> 18:13<br>
To: Haehnle, Nicolai <a class="moz-txt-link-rfc2396E" href="mailto:Nicolai.Haehnle@amd.com"><Nicolai.Haehnle@amd.com></a>; Michel
D<font face="Courier New">ä</font>nzer
<a class="moz-txt-link-rfc2396E" href="mailto:michel@daenzer.net"><michel@daenzer.net></a>; Liu, Monk
<a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>; Olsak, Marek
<a class="moz-txt-link-rfc2396E" href="mailto:Marek.Olsak@amd.com"><Marek.Olsak@amd.com></a>; Deucher, Alexander
<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>; Zhou, David(ChunMing)
<a class="moz-txt-link-rfc2396E" href="mailto:David1.Zhou@amd.com"><David1.Zhou@amd.com></a>;
Mao, David <a class="moz-txt-link-rfc2396E" href="mailto:David.Mao@amd.com"><David.Mao@amd.com></a><br>
Cc: Ramirez, Alejandro <a class="moz-txt-link-rfc2396E" href="mailto:Alejandro.Ramirez@amd.com"><Alejandro.Ramirez@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>; Filipas, Mario
<a class="moz-txt-link-rfc2396E" href="mailto:Mario.Filipas@amd.com"><Mario.Filipas@amd.com></a>; Ding, Pixel
<a class="moz-txt-link-rfc2396E" href="mailto:Pixel.Ding@amd.com"><Pixel.Ding@amd.com></a>; Li, Bingley
<a class="moz-txt-link-rfc2396E" href="mailto:Bingley.Li@amd.com"><Bingley.Li@amd.com></a>; Jiang, Jerry (SW)
<a class="moz-txt-link-rfc2396E" href="mailto:Jerry.Jiang@amd.com"><Jerry.Jiang@amd.com></a><br>
Subject: Re: TDR and VRAM lost handling in KMD (v2)</div>
<div><font face="Times New Roman"> </font></div>
<div>Am 12.10.2017 um 11:44 schrieb Nicolai H<font
face="Courier New">ä</font>hnle:</div>
<div>> On 12.10.2017 11:35, Michel D<font face="Courier
New">ä</font>nzer wrote:</div>
<div>>> On 12/10/17 11:23 AM, Christian K<font
face="Courier New">ö</font>nig wrote:</div>
<div>>>> Am 12.10.2017 um 11:10 schrieb Nicolai H<font
face="Courier New">ä</font>hnle:</div>
<div>>>>> On 12.10.2017 10:49, Christian K<font
face="Courier New">ö</font>nig wrote:</div>
<div>>>>>>> However, !guilty &&
ctx->reset_counter != adev->reset_counter </div>
<div>>>>>>> does not imply that the context
was lost.</div>
<div>>>>>>></div>
<div>>>>>>> The way I understand it, we
should return </div>
<div>>>>>>> AMDGPU_CTX_INNOCENT_RESET if
!guilty && ctx->vram_lost_counter !=
adev->vram_lost_counter.</div>
<div>>>>>>></div>
<div>>>>>>> As far as I understand it, the
case of !guilty && </div>
<div>>>>>>> ctx->reset_counter !=
adev->reset_counter && </div>
<div>>>>>>> ctx->vram_lost_counter ==</div>
<div>>>>>>> adev->vram_lost_counter
should return AMDGPU_CTX_NO_RESET, </div>
<div>>>>>>> adev->because a</div>
<div>>>>>>> GPU reset occurred, but it
didn't affect our context.</div>
<div>>>>>> I disagree on that.</div>
<div>>>>>></div>
<div>>>>>> AMDGPU_CTX_INNOCENT_RESET just means
what it does currently, there </div>
<div>>>>>> was a reset but we haven't been
causing it.</div>
<div>>>>>></div>
<div>>>>>> That the OpenGL extension is
specified otherwise is unfortunate, </div>
<div>>>>>> but I think we shouldn't use that
for the kernel interface here.</div>
<div>>>>> Two counterpoints:</div>
<div>>>>></div>
<div>>>>> 1. Why should any application care that
there was a reset while it </div>
<div>>>>> was idle? The OpenGL behavior is what
makes sense.</div>
<div>>>></div>
<div>>>> The application is certainly not interest if
a reset happened or </div>
<div>>>> not, but I though that the driver stack
might be.</div>
<div>>>></div>
<div>>>>></div>
<div>>>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't
actually mean anything today </div>
<div>>>>> because we never return it :)</div>
<div>>>>></div>
<div>>>></div>
<div>>>> Good point.</div>
<div>>>></div>
<div>>>>> amdgpu_ctx_query only ever returns
AMDGPU_CTX_UNKNOWN_RESET, which </div>
<div>>>>> is in line with the OpenGL spec: we're
conservatively returning </div>
<div>>>>> that a reset happened because we don't
know whether the context was </div>
<div>>>>> affected, and we return UNKNOWN because
we also don't know whether </div>
<div>>>>> the context was guilty or not.</div>
<div>>>>></div>
<div>>>>> Returning AMDGPU_CTX_NO_RESET in the
case of !guilty &&</div>
<div>>>>> ctx->vram_lost_counter ==
adev->vram_lost_counter is simply a</div>
<div>>>>> refinement and improvement of the
current, overly conservative </div>
<div>>>>> behavior.</div>
<div>>>></div>
<div>>>> Ok let's reenumerate what I think the
different return values should</div>
<div>>>> mean:</div>
<div>>>></div>
<div>>>> * AMDGPU_CTX_GUILTY_RESET</div>
<div>>>></div>
<div>>>> guilty is set to true for this context.</div>
<div>>>></div>
<div>>>> * AMDGPU_CTX_INNOCENT_RESET</div>
<div>>>></div>
<div>>>> guilty is not set and vram_lost_counter has
changed.</div>
<div>>>></div>
<div>>>> * AMDGPU_CTX_UNKNOWN_RESET</div>
<div>>>></div>
<div>>>> guilty is not set and vram_lost_counter has
not changed, but </div>
<div>>>> gpu_reset_counter has changed.</div>
<div>>></div>
<div>>> I don't understand the distinction you're
proposing between </div>
<div>>> AMDGPU_CTX_INNOCENT_RESET and
AMDGPU_CTX_UNKNOWN_RESET. I think both </div>
<div>>> cases you're describing should return either </div>
<div>>> AMDGPU_CTX_INNOCENT_RESET, if the value of
guilty is reliable, or </div>
<div>>> AMDGPU_CTX_UNKNOWN_RESET if it's not.</div>
<div>></div>
<div>> I think it'd make more sense if it was called </div>
<div>> "AMDGPU_CTX_UNAFFECTED_RESET".</div>
<div>></div>
<div>> So:</div>
<div>> - AMDGPU_CTX_GUILTY_RESET --> the context was
affected by a reset, and </div>
<div>> we know that it's the context's fault</div>
<div>> - AMDGPU_CTX_INNOCENT_RESET --> the context was
affected by a reset, </div>
<div>> and we know that it *wasn't* the context's fault (no
context job </div>
<div>> active)</div>
<div>> - AMDGPU_CTX_UNKNOWN_RESET --> the context was
affected by a reset, </div>
<div>> and we don't know who's responsible (this could be
returned in the </div>
<div>> unlikely case where context A's gfx job has not yet
finished, but </div>
<div>> context B's gfx job has already started; it could be
the fault of A, </div>
<div>> it could be the fault of B -- which somehow manages
to hang a part of </div>
<div>> the hardware that then prevents A's job from
finishing -- or it could </div>
<div>> be both; but it's a bit academic)</div>
<div>> - AMDGPU_CTX_UNAFFECTED_RESET --> there was a
reset, but this context </div>
<div>> wasn't affected</div>
<div>></div>
<div>> This last value would currently just be discarded by
Mesa (because we </div>
<div>> should only disturb applications when we have to),
but perhaps </div>
<div>> somebody else could find it useful?</div>
<div> </div>
<div>Yes, that's what I had in mind as well.</div>
<div> </div>
<div>Cause otherwise we would return AMDGPU_CTX_NO_RESET while
there actually was a reset and that certainly doesn't sound
correct to me.</div>
<div> </div>
<div>Regards,</div>
<div>Christian.</div>
<div> </div>
<div>></div>
<div>> Cheers,</div>
<div>> Nicolai</div>
<div> </div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> </font></div>
</span></font>
</blockquote>
<p><br>
</p>
</body>
</html>