<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">
      <blockquote type="cite"><span
          style="font-size:12.0pt;font-family:SimSun" lang="EN-US"></span>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
            I think context is better than entity, because for example
            if you only block entity_0 of context and allow entity_N
            run, that means the dependency between entities are broken
            (e.g. page table updates in </span></p>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Sdma
            entity pass but gfx submit in GFX entity blocked, not make
            sense to me)</span></p>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US">We</span><span
            style="font-size:12.0pt;font-family:SimSun">’<span
              lang="EN-US">d better either block the whole context or
              let not</span>…
          </span></p>
      </blockquote>
      Page table updates are not part of any context.<br>
      <br>
      So I think the only thing we can do is to mark the entity as not
      scheduled any more.<br>
      <br>
      <blockquote type="cite"><span
          style="font-size:12.0pt;font-family:SimSun" lang="EN-US"></span>
        <p class="MsoListParagraph"
          style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l1
          level1 lfo6">
          <span lang="EN-US"><span style="mso-list:Ignore">1.<span
                style="font:7.0pt "Times New Roman"">       
              </span></span></span><span lang="EN-US">Kick out all jobs
            in this “guilty” ctx’s KFIFO queue, and set all their fence
            status to “<b>ECANCELED</b>”</span></p>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Setting
            ECANCELED should be ok. But I think we should do this when
            we try to run the jobs and not during GPU reset.</span></p>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US"> </span></p>
        <p class="MsoNormal"><span
            style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
            without deep thought and expritment, I</span><span
            style="font-size:12.0pt;font-family:SimSun">’<span
              lang="EN-US">m not sure the difference between them, but
              kick it out in gpu_reset routine is more efficient, </span></span></p>
      </blockquote>
      I really don't think so. Kicking them out during gpu_reset sounds
      racy to me once more.<br>
      <br>
      And marking them canceled when we try to run them has the clear
      advantage that all dependencies are meet first.<br>
      <br>
      <blockquote type="cite">
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US">ML:
            KMD mark all contexts as guilty is because that way we can
            unify our IOCTL behavior: e.g. for IOCTL only block
            “guilty”context , no need to worry about vram-lost-counter
            anymore, that’s a implementation style. I don’t think it is
            related with UMD layer,</span></p>
        <span style="font-family:DengXian;color:windowtext" lang="EN-US"></span></blockquote>
      I don't think that this is a good idea. Instead when you want to
      unify the behavior we should use the vram_lost_counter as marker
      for the guilty context.<br>
      <br>
      Regards,<br>
      Christian.<br>
      <br>
      Am 11.10.2017 um 10:48 schrieb Liu, Monk:<br>
    </div>
    <blockquote type="cite"
cite="mid:BLUPR12MB0449287A92DF8D3EB30BE6A6844A0@BLUPR12MB0449.namprd12.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style><!--
/* Font Definitions */
@font-face
        {font-family:Wingdings;
        panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
        {font-family:SimSun;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:SimSun;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:??;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        text-align:justify;
        font-size:10.5pt;
        font-family:??;
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin:0cm;
        margin-bottom:.0001pt;
        text-align:justify;
        text-indent:21.0pt;
        font-size:10.5pt;
        font-family:??;
        color:black;}
p.msonormal0, li.msonormal0, div.msonormal0
        {mso-style-name:msonormal;
        mso-margin-top-alt:auto;
        margin-right:0cm;
        mso-margin-bottom-alt:auto;
        margin-left:0cm;
        font-size:12.0pt;
        font-family:??;
        color:black;}
span.EmailStyle19
        {mso-style-type:personal;
        font-family:??;
        color:windowtext;}
span.EmailStyle20
        {mso-style-type:personal;
        font-family:??;
        color:windowtext;}
span.EmailStyle23
        {mso-style-type:personal-reply;
        font-family:DengXian;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 90.0pt 72.0pt 90.0pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:65107788;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
@list l1
        {mso-list-id:387386877;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l1:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l1:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l1:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l1:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l1:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l1:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l1:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l1:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
@list l2
        {mso-list-id:697632068;
        mso-list-type:hybrid;
        mso-list-template-ids:448443560 67698689 67698691 67698693 67698689 67698691 67698693 67698689 67698691 67698693;}
@list l2:level1
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level2
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level3
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:63.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level4
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level5
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level6
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:126.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level7
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level8
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l2:level9
        {mso-level-number-format:bullet;
        mso-level-text:?;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:189.0pt;
        text-indent:-21.0pt;
        font-family:Wingdings;}
@list l3
        {mso-list-id:1298757877;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l3:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l3:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l3:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l3:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l3:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l3:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l3:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l3:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l3:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
@list l4
        {mso-list-id:1671643712;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l4:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l4:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l4:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l4:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l4:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l4:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l4:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l4:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l4:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
@list l5
        {mso-list-id:1679431240;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l5:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l5:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l5:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l5:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l5:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l5:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l5:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l5:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l5:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
@list l6
        {mso-list-id:1765762926;
        mso-list-type:hybrid;
        mso-list-template-ids:1024615094 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l6:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:21.0pt;
        text-indent:-21.0pt;}
@list l6:level2
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%2\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:42.0pt;
        text-indent:-21.0pt;}
@list l6:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:63.0pt;
        text-indent:-21.0pt;}
@list l6:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:84.0pt;
        text-indent:-21.0pt;}
@list l6:level5
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%5\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:105.0pt;
        text-indent:-21.0pt;}
@list l6:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:126.0pt;
        text-indent:-21.0pt;}
@list l6:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:147.0pt;
        text-indent:-21.0pt;}
@list l6:level8
        {mso-level-number-format:alpha-lower;
        mso-level-text:"%8\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:168.0pt;
        text-indent:-21.0pt;}
@list l6:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        margin-left:189.0pt;
        text-indent:-21.0pt;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p><span style="font-size:12.0pt;color:black" lang="EN-US"><o:p> </o:p></span></p>
        <p><span style="font-size:12.0pt;color:black" lang="EN-US">On
            "guilty": "guilty" is a term that's used by APIs (e.g.
            OpenGL), so it's reasonable to use it. However, it
            <i>does not</i> make sense to mark idle contexts as "guilty"
            just because VRAM is lost. VRAM lost is a perfect example
            where the driver should report context lost to applications
            with the "innocent" flag for contexts that were idle at the
            time of reset. The only context(s) that should be reported
            as "guilty" (or perhaps "unknown" in some cases) are the
            ones that were executing at the time of reset.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US">ML:
            KMD mark all contexts as guilty is because that way we can
            unify our IOCTL behavior: e.g. for IOCTL only block
            “guilty”context , no need to worry about vram-lost-counter
            anymore, that’s a implementation style. I don’t think it is
            related with UMD layer,<o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US">For
            UMD the gl-context isn’t aware of by KMD, so UMD can
            implement it own “guilty” gl-context if you want.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US">If
            KMD doesn’t mark all ctx as guilty after VRAM lost, can you
            illustrate what rule KMD should obey to check in KMS IOCTL
            like cs_sumbit ?? let’s see which way better
            <o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
            style="font-family:DengXian;color:windowtext" lang="EN-US"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0cm 0cm 0cm">
            <p class="MsoNormal" style="text-align:left" align="left"><b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
                  lang="EN-US">From:</span></b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
                lang="EN-US"> Haehnle, Nicolai <br>
                <b>Sent:</b> Wednesday, October 11, 2017 4:41 PM<br>
                <b>To:</b> Liu, Monk <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>; Koenig,
                Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Olsak, Marek
                <a class="moz-txt-link-rfc2396E" href="mailto:Marek.Olsak@amd.com"><Marek.Olsak@amd.com></a>; Deucher, Alexander
                <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
                <b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>; Ding, Pixel
                <a class="moz-txt-link-rfc2396E" href="mailto:Pixel.Ding@amd.com"><Pixel.Ding@amd.com></a>; Jiang, Jerry (SW)
                <a class="moz-txt-link-rfc2396E" href="mailto:Jerry.Jiang@amd.com"><Jerry.Jiang@amd.com></a>; Li, Bingley
                <a class="moz-txt-link-rfc2396E" href="mailto:Bingley.Li@amd.com"><Bingley.Li@amd.com></a>; Ramirez, Alejandro
                <a class="moz-txt-link-rfc2396E" href="mailto:Alejandro.Ramirez@amd.com"><Alejandro.Ramirez@amd.com></a>; Filipas, Mario
                <a class="moz-txt-link-rfc2396E" href="mailto:Mario.Filipas@amd.com"><Mario.Filipas@amd.com></a><br>
                <b>Subject:</b> Re: TDR and VRAM lost handling in KMD:<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal" style="text-align:left" align="left"><span
            lang="EN-US"><o:p> </o:p></span></p>
        <div id="divtagdefaultwrapper">
          <p><span style="font-size:12.0pt;color:black" lang="EN-US">From
              a Mesa perspective, this almost all sounds reasonable to
              me.<o:p></o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US"><o:p> </o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US">On
              "guilty": "guilty" is a term that's used by APIs (e.g.
              OpenGL), so it's reasonable to use it. However, it
              <i>does not</i> make sense to mark idle contexts as
              "guilty" just because VRAM is lost. VRAM lost is a perfect
              example where the driver should report context lost to
              applications with the "innocent" flag for contexts that
              were idle at the time of reset. The only context(s) that
              should be reported as "guilty" (or perhaps "unknown" in
              some cases) are the ones that were executing at the time
              of reset.<o:p></o:p></span></p>
          <p class="MsoNormal" style="text-align:left" align="left"><span
style="font-size:12.0pt;font-family:"Calibri",sans-serif"
              lang="EN-US"><o:p> </o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US">On
              whether the whole context is marked as guilty from a user
              space perspective, it would simply be nice for user space
              to get consistent answers. It would be a bit odd if we
              could e.g. succeed in submitting an SDMA job after a GFX
              job was rejected. This would point in favor of marking the
              entire context as guilty (although that could happen
              lazily instead of at reset time). On the other hand, if
              that's too big a burden for the kernel implementation I'm
              sure we can live without it.<o:p></o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US"><o:p> </o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US">Cheers,<o:p></o:p></span></p>
          <p><span style="font-size:12.0pt;color:black" lang="EN-US">Nicolai<o:p></o:p></span></p>
        </div>
        <div class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
            lang="EN-US">
            <hr size="3" align="center" width="98%">
          </span></div>
        <div id="divRplyFwdMsg">
          <p class="MsoNormal" style="text-align:left" align="left"><b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif"
                lang="EN-US">From:</span></b><span
              style="font-size:11.0pt;font-family:"Calibri",sans-serif"
              lang="EN-US"> Liu, Monk<br>
              <b>Sent:</b> Wednesday, October 11, 2017 10:15:40 AM<br>
              <b>To:</b> Koenig, Christian; Haehnle, Nicolai; Olsak,
              Marek; Deucher, Alexander<br>
              <b>Cc:</b> <a href="mailto:amd-gfx@lists.freedesktop.org"
                moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
              Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez,
              Alejandro; Filipas, Mario<br>
              <b>Subject:</b> RE: TDR and VRAM lost handling in KMD:</span><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
              lang="EN-US">
              <o:p></o:p></span></p>
          <div>
            <p class="MsoNormal" style="text-align:left" align="left"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
                lang="EN-US"> <o:p></o:p></span></p>
          </div>
        </div>
        <div>
          <p class="MsoListParagraph"
            style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l0
            level1 lfo2">
            <!--[if !supportLists]--><span lang="EN-US"><span
                style="mso-list:Ignore">1.<span style="font:7.0pt
                  "Times New Roman"">       
                </span></span></span><!--[endif]--><span lang="EN-US">Set
              its fence error status to “<b>ETIME</b>”,<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">No,
              as I already explained ETIME is for synchronous operation.<br>
              <br>
              In other words when we return ETIME from the wait IOCTL it
              would mean that the waiting has somehow timed out, but not
              the job we waited for.<br>
              <br>
              Please use ECANCELED as well or some other error code when
              we find that we need to distinct the timedout job from the
              canceled ones (probably a good idea, but I'm not sure).<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              I</span><span style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">m okay if you insist not to use ETIME<o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoListParagraph"
            style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l4
            level1 lfo4">
            <!--[if !supportLists]--><span lang="EN-US"><span
                style="mso-list:Ignore">1.<span style="font:7.0pt
                  "Times New Roman"">       
                </span></span></span><!--[endif]--><span lang="EN-US">Find
              the entity/ctx behind this job, and set this ctx as “<b>guilty</b>”<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Not
              sure. Do we want to set the whole context as guilty or
              just the entity?<br>
              <br>
              Setting the whole contexts as guilty sounds racy to me.<br>
              <br>
              BTW: We should use a different name than "guilty", maybe
              just "bool canceled;" ?<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              I think context is better than entity, because for example
              if you only block entity_0 of context and allow entity_N
              run, that means the dependency between entities are broken
              (e.g. page table updates in <o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Sdma
              entity pass but gfx submit in GFX entity blocked, not make
              sense to me)<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">We</span><span
              style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">d better either block the whole context or
                let not</span>…
              <span lang="EN-US"><o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoListParagraph"
            style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l1
            level1 lfo6">
            <!--[if !supportLists]--><span lang="EN-US"><span
                style="mso-list:Ignore">1.<span style="font:7.0pt
                  "Times New Roman"">       
                </span></span></span><!--[endif]--><span lang="EN-US">Kick
              out all jobs in this “guilty” ctx’s KFIFO queue, and set
              all their fence status to “<b>ECANCELED</b>”<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Setting
              ECANCELED should be ok. But I think we should do this when
              we try to run the jobs and not during GPU reset.<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              without deep thought and expritment, I</span><span
              style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">m not sure the difference between them, but
                kick it out in gpu_reset routine is more efficient, <o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Otherwise
              you need to check context/entity guilty flag in run_job
              routine
            </span><span style="font-size:12.0pt;font-family:SimSun">…<span
                lang="EN-US"> and you need to it for every
                context/entity, I don</span>’<span lang="EN-US">t see
                why
                <o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">We
              don</span><span
              style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">t just kickout all of them in gpu_reset
                stage
              </span>…<span lang="EN-US">.<o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoListParagraph"
            style="margin-left:42.0pt;text-indent:-21.0pt;mso-list:l3
            level2 lfo8">
            <!--[if !supportLists]--><span lang="EN-US"><span
                style="mso-list:Ignore">a)<span style="font:7.0pt
                  "Times New Roman"">      
                </span></span></span><!--[endif]--><span lang="EN-US">Iterate
              over all living ctx, and set all ctx as “<b>guilty</b>”
              since VRAM lost actually ruins all VRAM contents<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">No,
              that shouldn't be done by comparing the counters.
              Iterating over all contexts is way to much overhead.<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              because I want to make KMS IOCTL rules clean, like they
              don</span><span
              style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">t need to differentiate VRAM lost or not,
                they only interested in if the context is guilty or not,
                and block<o:p></o:p></span></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Submit
              for guilty ones.
              <o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><b><span
                style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Can
                you give more details of your idea? And better the
                detail implement in cs_submit, I want to see how you
                want to block submit without checking context guilty
                flag<o:p></o:p></span></b></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoListParagraph"
            style="margin-left:42.0pt;text-indent:-21.0pt;mso-list:l5
            level2 lfo10">
            <!--[if !supportLists]--><span lang="EN-US"><span
                style="mso-list:Ignore">a)<span style="font:7.0pt
                  "Times New Roman"">      
                </span></span></span><!--[endif]--><span lang="EN-US">Kick
              out all jobs in all ctx’s KFIFO queue, and set all their
              fence status to “<b>ECANCELDED</b>”<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Yes
              and no, that should be done when we try to run the jobs
              and not during GPU reset.<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              again, kicking out them in gpu reset routine is high
              efficient, otherwise you need check on every job in
              run_job()<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Besides,
              can you illustrate the detail implementation ?<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Yes
              and no, dma_fence_get_status() is some specific handling
              for sync_file debugging (no idea why that made it into the
              common fence code).<br>
              <br>
              It was replaced by putting the error code directly into
              the fence, so just reading that one after waiting should
              be ok.<br>
              <br>
              Maybe we should fix dma_fence_get_status() to do the right
              thing for this?<o:p></o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">[ML]
              yeah, that</span><span
              style="font-size:12.0pt;font-family:SimSun">’<span
                lang="EN-US">s too confusing, the name sound really the
                one I want to use, we should change it</span>…<span
                lang="EN-US"><o:p></o:p></span></span></p>
          <p class="MsoNormal"><b><span
                style="font-size:12.0pt;font-family:SimSun" lang="EN-US">But
                look into the implement, I don</span></b><b><span
                style="font-size:12.0pt;font-family:SimSun">’<span
                  lang="EN-US">t see why we cannot use it ? it also
                  finally return the fence->error <o:p></o:p></span></span></b></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span style="color:windowtext"
              lang="EN-US"><o:p> </o:p></span></p>
          <p class="MsoNormal"><span style="color:windowtext"
              lang="EN-US"><o:p> </o:p></span></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0cm 0cm 0cm">
              <p class="MsoNormal" style="text-align:left" align="left"><b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
                    lang="EN-US">From:</span></b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"
                  lang="EN-US"> Koenig, Christian <br>
                  <b>Sent:</b> Wednesday, October 11, 2017 3:21 PM<br>
                  <b>To:</b> Liu, Monk <<a
                    href="mailto:Monk.Liu@amd.com"
                    moz-do-not-send="true">Monk.Liu@amd.com</a>>;
                  Haehnle, Nicolai <<a
                    href="mailto:Nicolai.Haehnle@amd.com"
                    moz-do-not-send="true">Nicolai.Haehnle@amd.com</a>>;
                  Olsak, Marek <<a href="mailto:Marek.Olsak@amd.com"
                    moz-do-not-send="true">Marek.Olsak@amd.com</a>>;
                  Deucher, Alexander <<a
                    href="mailto:Alexander.Deucher@amd.com"
                    moz-do-not-send="true">Alexander.Deucher@amd.com</a>><br>
                  <b>Cc:</b> <a
                    href="mailto:amd-gfx@lists.freedesktop.org"
                    moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                  Ding, Pixel <<a href="mailto:Pixel.Ding@amd.com"
                    moz-do-not-send="true">Pixel.Ding@amd.com</a>>;
                  Jiang, Jerry (SW) <<a
                    href="mailto:Jerry.Jiang@amd.com"
                    moz-do-not-send="true">Jerry.Jiang@amd.com</a>>;
                  Li, Bingley <<a href="mailto:Bingley.Li@amd.com"
                    moz-do-not-send="true">Bingley.Li@amd.com</a>>;
                  Ramirez, Alejandro <<a
                    href="mailto:Alejandro.Ramirez@amd.com"
                    moz-do-not-send="true">Alejandro.Ramirez@amd.com</a>>;
                  Filipas, Mario <<a
                    href="mailto:Mario.Filipas@amd.com"
                    moz-do-not-send="true">Mario.Filipas@amd.com</a>><br>
                  <b>Subject:</b> Re: TDR and VRAM lost handling in KMD:<o:p></o:p></span></p>
            </div>
          </div>
          <p class="MsoNormal" style="text-align:left" align="left"><span
              lang="EN-US"><o:p> </o:p></span></p>
          <div>
            <p class="MsoNormal" style="text-align:left" align="left"><span
                lang="EN-US">See inline:<br>
                <br>
                Am 11.10.2017 um 07:33 schrieb Liu, Monk:</span><span
                style="font-size:12.0pt" lang="EN-US"><o:p></o:p></span></p>
          </div>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal"><span lang="EN-US">Hi Christian &
                Nicolai,<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">We need to achieve
                some agreements on what should MESA/UMD do and what
                should KMD do,
                <b>please give your comments with “okay” or “No” and
                  your idea on below items,</b><o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">When
                a job timed out (set from lockup_timeout kernel
                parameter), What KMD should do in TDR routine :<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">1.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Update
                adev-><b>gpu_reset_counter</b>, and stop scheduler
                first, (<b>gpu_reset_counter</b> is used to force vm
                flush after GPU reset, out of this thread’s scope so no
                more discussion on it)<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">2.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Set
                its fence error status to “<b>ETIME</b>”,<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">No,
              as I already explained ETIME is for synchronous operation.<br>
              <br>
              In other words when we return ETIME from the wait IOCTL it
              would mean that the waiting has somehow timed out, but not
              the job we waited for.<br>
              <br>
              Please use ECANCELED as well or some other error code when
              we find that we need to distinct the timedout job from the
              canceled ones (probably a good idea, but I'm not sure).<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">3.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Find
                the entity/ctx behind this job, and set this ctx as “<b>guilty</b>”<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Not
              sure. Do we want to set the whole context as guilty or
              just the entity?<br>
              <br>
              Setting the whole contexts as guilty sounds racy to me.<br>
              <br>
              BTW: We should use a different name than "guilty", maybe
              just "bool canceled;" ?<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">4.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Kick
                out this job from scheduler’s mirror list, so this job
                won’t get re-scheduled to ring anymore.<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">5.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Kick
                out all jobs in this “guilty” ctx’s KFIFO queue, and set
                all their fence status to “<b>ECANCELED</b>”<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Setting
              ECANCELED should be ok. But I think we should do this when
              we try to run the jobs and not during GPU reset.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">6.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Force
                signal all fences that get kicked out by above two
                steps,<b> otherwise UMD will block forever if waiting on
                  those fences</b><o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">7.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Do
                gpu reset, which is can be some callbacks to let
                bare-metal and SR-IOV implement with their favor style
                <o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">8.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">After
                reset, KMD need to aware if the VRAM lost happens or
                not, bare-metal can implement some function to judge,
                while for SR-IOV I prefer to read it from GIM side (for
                initial version we consider it’s always VRAM lost, till
                GIM side change aligned)<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">9.<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">If
                VRAM lost not hit, continue, otherwise:<o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:42.0pt;text-indent:-21.0pt;mso-list:l6
              level2 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">a)<span style="font:7.0pt
                    "Times New Roman"">      
                  </span></span></span><!--[endif]--><span lang="EN-US">Update
                adev-><b>vram_lost_counter</b>,<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:42.0pt;text-indent:-21.0pt;mso-list:l6
              level2 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">b)<span style="font:7.0pt
                    "Times New Roman"">      
                  </span></span></span><!--[endif]--><span lang="EN-US">Iterate
                over all living ctx, and set all ctx as “<b>guilty</b>”
                since VRAM lost actually ruins all VRAM contents<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">No,
              that shouldn't be done by comparing the counters.
              Iterating over all contexts is way to much overhead.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:42.0pt;text-indent:-21.0pt;mso-list:l6
              level2 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">c)<span style="font:7.0pt
                    "Times New Roman"">       
                  </span></span></span><!--[endif]--><span lang="EN-US">Kick
                out all jobs in all ctx’s KFIFO queue, and set all their
                fence status to “<b>ECANCELDED</b>”<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Yes
              and no, that should be done when we try to run the jobs
              and not during GPU reset.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">10.<span style="font:7.0pt
                    "Times New Roman"">    
                  </span></span></span><!--[endif]--><span lang="EN-US">Do
                GTT recovery and VRAM page tables/entries recovery
                (optional, do we need it ???)<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Yes,
              that is still needed. As Nicolai explained we can't be
              sure that VRAM is still 100% correct even when it isn't
              cleared.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l6
              level1 lfo14">
              <!--[if !supportLists]--><span lang="EN-US"><span
                  style="mso-list:Ignore">11.<span style="font:7.0pt
                    "Times New Roman"">    
                  </span></span></span><!--[endif]--><span lang="EN-US">Re-schedule
                all JOBs remains in mirror list to ring again and
                restart scheduler (for VRAM lost case, no JOB will
                re-scheduled)<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">For
                cs_wait() IOCTL:<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">After it found fence
                signaled, it should check with
                <b>“dma_fence_get_status” </b>to see if there is error
                there,<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">And return the error
                status of fence<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Yes
              and no, dma_fence_get_status() is some specific handling
              for sync_file debugging (no idea why that made it into the
              common fence code).<br>
              <br>
              It was replaced by putting the error code directly into
              the fence, so just reading that one after waiting should
              be ok.<br>
              <br>
              Maybe we should fix dma_fence_get_status() to do the right
              thing for this?<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">For
                cs_wait_fences() IOCTL:<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">Similar with above
                approach<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">For
                cs_submit() IOCTL:<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">It need to check if
                current ctx been marked as “<b>guilty</b>” and return “<b>ECANCELED</b>”
                if so<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">Introduce
                a new IOCTL to let UMD query
                <b>vram_lost_counter</b>:<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">This way, UMD can
                also block app from submitting, like @Nicolai mentioned,
                we can cache one copy of
                <b>vram_lost_counter</b> when enumerate physical device,
                and deny all <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">gl-context from
                submitting if the counter queried bigger than that one
                cached in physical device. (looks a little overkill to
                me, but easy to implement )
                <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">UMD can also return
                error to APP when creating gl-context if found current
                queried<b> vram_lost_counter
                </b>bigger than that one cached in physical device.<o:p></o:p></span></p>
          </blockquote>
          <p class="MsoNormal"
            style="margin-bottom:12.0pt;text-align:left" align="left"><span
              style="font-size:12.0pt;font-family:SimSun" lang="EN-US">Okay.
              Already have a patch for this, please review that one if
              you haven't already done so.<br>
              <br>
              Regards,<br>
              Christian.<br>
              <br>
              <o:p></o:p></span></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">BTW: I realized that
                gl-context is a little different with kernel’s context.
                Because for kernel. BO is not related with context but
                only with FD, while in UMD, BO have a backend<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">gl-context, so block
                submitting in UMD layer is also needed although KMD will
                do its job as bottom line
                <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoListParagraph"
              style="margin-left:21.0pt;text-indent:-21.0pt;mso-list:l2
              level1 lfo12">
              <!--[if !supportLists]--><span
                style="font-family:Wingdings" lang="EN-US"><span
                  style="mso-list:Ignore">?<span style="font:7.0pt
                    "Times New Roman""> 
                  </span></span></span><!--[endif]--><span lang="EN-US">Basically
                “vram_lost_counter” is exposure by kernel to let UMD
                take the control of robust extension feature, it will be
                UMD’s call to move, KMD only deny “guilty” context from
                submitting<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">Need your feedback,
                thx<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">We’d better make TDR
                feature landed ASAP<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US">BR Monk<o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
            <p class="MsoNormal"><span lang="EN-US"> <o:p></o:p></span></p>
          </blockquote>
          <p><span lang="EN-US"><o:p> </o:p></span></p>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>