<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-03-18 6:41 a.m., Zhang, Jack
      (Jian) wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:DM5PR1201MB02041495067BD0C46A8840B0BB699@DM5PR1201MB0204.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style>@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
        {font-family:"\@DengXian";
        panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face
        {font-family:"Microsoft YaHei";
        panose-1:2 11 5 3 2 2 4 2 2 4;}@font-face
        {font-family:"\@Microsoft YaHei";}p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}div.WordSection1
        {page:WordSection1;}ol
        {margin-bottom:0in;}ul
        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <p style="font-family:Arial;font-size:11pt;color:#0078D7;margin:5pt;" align="Left">
        [AMD Official Use Only - Internal Distribution Only]<br>
      </p>
      <br>
      <div>
        <div class="WordSection1">
          <p class="MsoNormal">Hi, Andrey<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">Let me summarize the background of this
            patch:<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">In TDR resubmit step
            “amdgpu_device_recheck_guilty_jobs,<o:p></o:p></p>
          <p class="MsoNormal">It will submit first jobs of each ring
            and do guilty job re-check.
            <o:p></o:p></p>
          <p class="MsoNormal">At that point, We had to make sure each
            job is in the mirror list(or re-inserted back already).<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">But we found the current code never
            re-insert the job to mirror list in the 2<sup>nd</sup>, 3<sup>rd</sup>
            job_timeout thread(Bailing TDR thread).<o:p></o:p></p>
          <p class="MsoNormal">This not only will cause memleak of the
            bailing jobs. What’s more important, the 1<sup>st</sup> tdr
            thread can never iterate the bailing job and set its guilty
            status to a correct status.<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">Therefore, we had to re-insert the job(or
            even not delete node) for bailing job.<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">For the above V3 patch, the racing
            condition in my mind is:<o:p></o:p></p>
          <p class="MsoNormal">we cannot make sure all bailing jobs are
            finished before we do amdgpu_device_recheck_guilty_jobs.</p>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>Yes,that race i missed - so you say that for 2nd, baling thread
      who extracted the job, even if he reinsert it right away back
      after driver callback return DRM_GPU_SCHED_STAT_BAILING, there is
      small time slot where the job is not in mirror list and so the 1st
      TDR might miss it and not find that  2nd job is the actual guilty
      job, right ? But, still this job will get back into mirror list,
      and since it's really the bad job, it will never signal completion
      and so on the next timeout cycle it will be caught (of course
      there is a starvation scenario here if more TDRs kick in and it
      bails out again but this is really unlikely).<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:DM5PR1201MB02041495067BD0C46A8840B0BB699@DM5PR1201MB0204.namprd12.prod.outlook.com">
      <div>
        <div class="WordSection1">
          <p class="MsoNormal"><o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">Based on this insight, I think we have
            two options to solve this issue:<o:p></o:p></p>
          <ol style="margin-top:0in" type="1" start="1">
            <li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Skip
              delete node in tdr thread2, thread3, 4 … (using mutex or
              atomic variable)<o:p></o:p></li>
            <li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Re-insert
              back bailing job, and meanwhile use semaphore in each tdr
              thread to keep the sequence as expected and ensure each
              job is in the mirror list when do resubmit step.<o:p></o:p></li>
          </ol>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">For Option1, logic is simpler and we need
            only one global atomic variable:<o:p></o:p></p>
          <p class="MsoNormal">What do you think about this plan?<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">Option1 should look like the following
            logic:<o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">+static atomic_t in_reset;            
            //a global atomic var for synchronization<o:p></o:p></p>
          <p class="MsoNormal">static void drm_sched_process_job(struct
            dma_fence *f, struct dma_fence_cb *cb);<o:p></o:p></p>
          <p class="MsoNormal"><o:p></o:p></p>
          <p class="MsoNormal"> /**<o:p></o:p></p>
          <p class="MsoNormal">@@ -295,6 +296,12 @@ static void
            drm_sched_job_timedout(struct work_struct *work)<o:p></o:p></p>
          <p class="MsoNormal">                 *
            drm_sched_cleanup_jobs. It will be reinserted back after
            sched->thread<o:p></o:p></p>
          <p class="MsoNormal">                 * is parked at which
            point it's safe.<o:p></o:p></p>
          <p class="MsoNormal">                 */<o:p></o:p></p>
          <p class="MsoNormal">+               if
            (atomic_cmpxchg(&in_reset, 0, 1) != 0) {  //skip delete
            node if it’s thead1,2,3,….<o:p></o:p></p>
          <p class="MsoNormal">+                      
            spin_unlock(&sched->job_list_lock);<o:p></o:p></p>
          <p class="MsoNormal">+                      
            drm_sched_start_timeout(sched);<o:p></o:p></p>
          <p class="MsoNormal">+                       return;<o:p></o:p></p>
          <p class="MsoNormal">+               }<o:p></o:p></p>
          <p class="MsoNormal">+<o:p></o:p></p>
          <p class="MsoNormal">               
            list_del_init(&job->node);<o:p></o:p></p>
          <p class="MsoNormal">               
            spin_unlock(&sched->job_list_lock);<o:p></o:p></p>
          <p class="MsoNormal"><o:p></o:p></p>
          <p class="MsoNormal">@@ -320,6 +327,7 @@ static void
            drm_sched_job_timedout(struct work_struct *work)<o:p></o:p></p>
          <p class="MsoNormal">       
            spin_lock(&sched->job_list_lock);<o:p></o:p></p>
          <p class="MsoNormal">        drm_sched_start_timeout(sched);<o:p></o:p></p>
          <p class="MsoNormal">       
            spin_unlock(&sched->job_list_lock);<o:p></o:p></p>
          <p class="MsoNormal">+       atomic_set(&in_reset, 0);
            //reset in_reset when the first thread finished tdr<o:p></o:p></p>
          <p class="MsoNormal">}</p>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>Technically looks like it should work as you don't access the job
      pointer any longer and so no risk that if signaled it will be
      freed by drm_sched_get_cleanup_job but,you can't just use one
      global variable an by this bailing from TDR when different drivers
      run their TDR threads in parallel, and even for amdgpu, if devices
      in different XGMI hives or 2 independent devices in non XGMI
      setup. There should be defined some kind of GPU reset group
      structure on drm_scheduler level for which this variable would be
      used.</p>
    <p>P.S I wonder why we can't just ref-count the job so that even if
      drm_sched_get_cleanup_job would delete it before we had a chance
      to stop the scheduler thread, we wouldn't crash. This would avoid
      all the dance with deletion and reinsertion.<br>
    </p>
    <p>Andrey</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:DM5PR1201MB02041495067BD0C46A8840B0BB699@DM5PR1201MB0204.namprd12.prod.outlook.com">
      <div>
        <div class="WordSection1">
          <p class="MsoNormal"><o:p></o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p class="MsoNormal">Thanks,<o:p></o:p></p>
          <p class="MsoNormal">Jack<o:p></o:p></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0in 0in 0in">
              <p class="MsoNormal"><b>From:</b> amd-gfx
                <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org"><amd-gfx-bounces@lists.freedesktop.org></a>
                <b>On Behalf Of </b>Zhang, Jack (Jian)<br>
                <b>Sent:</b> Wednesday, March 17, 2021 11:11 PM<br>
                <b>To:</b> Christian König
                <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com"><ckoenig.leichtzumerken@gmail.com></a>;
                <a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org">dri-devel@lists.freedesktop.org</a>;
                <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>; Koenig, Christian
                <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Liu, Monk
                <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>; Deng, Emily
                <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>; Rob Herring
                <a class="moz-txt-link-rfc2396E" href="mailto:robh@kernel.org"><robh@kernel.org></a>; Tomeu Vizoso
                <a class="moz-txt-link-rfc2396E" href="mailto:tomeu.vizoso@collabora.com"><tomeu.vizoso@collabora.com></a>; Steven Price
                <a class="moz-txt-link-rfc2396E" href="mailto:steven.price@arm.com"><steven.price@arm.com></a>; Grodzovsky, Andrey
                <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a><br>
                <b>Subject:</b> Re: [PATCH v3] drm/scheduler re-insert
                Bailing job to avoid memleak<o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal"><o:p> </o:p></p>
          <p style="margin:5.0pt"><span style="font-family:"Arial",sans-serif;color:#0078D7">[AMD
              Official Use Only - Internal Distribution Only]<o:p></o:p></span></p>
          <p class="MsoNormal"><o:p> </o:p></p>
          <div>
            <p style="margin:5.0pt"><span style="font-family:"Arial",sans-serif;color:#0078D7">[AMD
                Official Use Only - Internal Distribution Only]<o:p></o:p></span></p>
            <p class="MsoNormal"><o:p> </o:p></p>
            <div>
              <p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black">Hi</span><span style="font-family:"Microsoft
                  YaHei",sans-serif;color:black" lang="ZH-CN">,</span><span style="font-family:"Arial",sans-serif;color:black">Andrey</span><span style="font-family:"Microsoft
                  YaHei",sans-serif;color:black" lang="ZH-CN">,</span><span style="font-family:"Arial",sans-serif;color:black"><o:p></o:p></span></p>
              <div>
                <p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black"><o:p> </o:p></span></p>
              </div>
              <div>
                <p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:"Arial",sans-serif;color:black">Good catch</span><span style="font-family:"Microsoft
                    YaHei",sans-serif;color:black" lang="ZH-CN">,</span><span style="font-family:"Arial",sans-serif;color:black">I will
                    expore this corner case and give feedback soon</span><span style="font-family:"Microsoft
                    YaHei",sans-serif;color:black" lang="ZH-CN">~</span><span style="font-family:"Arial",sans-serif;color:black"><o:p></o:p></span></p>
              </div>
              <div>
                <p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black">Best</span><span style="font-family:"Microsoft
                    YaHei",sans-serif;color:black" lang="ZH-CN">,</span><span style="font-family:"Arial",sans-serif;color:black"><o:p></o:p></span></p>
              </div>
              <p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black">Jack<o:p></o:p></span></p>
              <div>
                <p class="MsoNormal"><span style="font-family:"Arial",sans-serif;color:black"><o:p> </o:p></span></p>
              </div>
              <div class="MsoNormal" style="text-align:center" align="center">
                <hr width="98%" size="2" align="center">
              </div>
              <div id="divRplyFwdMsg">
                <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> Grodzovsky, Andrey <<a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true">Andrey.Grodzovsky@amd.com</a>><br>
                    <b>Sent:</b> Wednesday, March 17, 2021 10:50:59 PM<br>
                    <b>To:</b> Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>>;
                    Zhang, Jack (Jian) <<a href="mailto:Jack.Zhang1@amd.com" moz-do-not-send="true">Jack.Zhang1@amd.com</a>>;
                    <a href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>
                    <<a href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                    <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>>;
                    Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    Liu, Monk <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>>;
                    Deng, Emily <<a href="mailto:Emily.Deng@amd.com" moz-do-not-send="true">Emily.Deng@amd.com</a>>;
                    Rob Herring <<a href="mailto:robh@kernel.org" moz-do-not-send="true">robh@kernel.org</a>>;
                    Tomeu Vizoso <<a href="mailto:tomeu.vizoso@collabora.com" moz-do-not-send="true">tomeu.vizoso@collabora.com</a>>;
                    Steven Price <<a href="mailto:steven.price@arm.com" moz-do-not-send="true">steven.price@arm.com</a>><br>
                    <b>Subject:</b> Re: [PATCH v3] drm/scheduler
                    re-insert Bailing job to avoid memleak</span>
                  <o:p></o:p></p>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
              </div>
              <div>
                <div>
                  <p class="MsoNormal">I actually have a race condition
                    concern here - see bellow -<br>
                    <br>
                    On 2021-03-17 3:43 a.m., Christian König wrote:<br>
                    > I was hoping Andrey would take a look since I'm
                    really busy with other <br>
                    > work right now.<br>
                    ><br>
                    > Regards,<br>
                    > Christian.<br>
                    ><br>
                    > Am 17.03.21 um 07:46 schrieb Zhang, Jack
                    (Jian):<br>
                    >> Hi, Andrey/Crhistian and Team,<br>
                    >><br>
                    >> I didn't receive the reviewer's message
                    from maintainers on panfrost <br>
                    >> driver for several days.<br>
                    >> Due to this patch is urgent for my current
                    working project.<br>
                    >> Would you please help to give some review
                    ideas?<br>
                    >><br>
                    >> Many Thanks,<br>
                    >> Jack<br>
                    >> -----Original Message-----<br>
                    >> From: Zhang, Jack (Jian)<br>
                    >> Sent: Tuesday, March 16, 2021 3:20 PM<br>
                    >> To: <a href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                    <br>
                    >> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    Grodzovsky, Andrey
                    <br>
                    >> <<a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true">Andrey.Grodzovsky@amd.com</a>>;
                    Liu, Monk <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>>;
                    Deng,
                    <br>
                    >> Emily <<a href="mailto:Emily.Deng@amd.com" moz-do-not-send="true">Emily.Deng@amd.com</a>>;
                    Rob Herring <<a href="mailto:robh@kernel.org" moz-do-not-send="true">robh@kernel.org</a>>;
                    Tomeu
                    <br>
                    >> Vizoso <<a href="mailto:tomeu.vizoso@collabora.com" moz-do-not-send="true">tomeu.vizoso@collabora.com</a>>;
                    Steven Price <<a href="mailto:steven.price@arm.com" moz-do-not-send="true">steven.price@arm.com</a>><br>
                    >> Subject: RE: [PATCH v3] drm/scheduler
                    re-insert Bailing job to avoid <br>
                    >> memleak<br>
                    >><br>
                    >> [AMD Public Use]<br>
                    >><br>
                    >> Ping<br>
                    >><br>
                    >> -----Original Message-----<br>
                    >> From: Zhang, Jack (Jian)<br>
                    >> Sent: Monday, March 15, 2021 1:24 PM<br>
                    >> To: Jack Zhang <<a href="mailto:Jack.Zhang1@amd.com" moz-do-not-send="true">Jack.Zhang1@amd.com</a>>;
                    <br>
                    >> <a href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                    <br>
                    >> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    Grodzovsky, Andrey
                    <br>
                    >> <<a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true">Andrey.Grodzovsky@amd.com</a>>;
                    Liu, Monk <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>>;
                    Deng,
                    <br>
                    >> Emily <<a href="mailto:Emily.Deng@amd.com" moz-do-not-send="true">Emily.Deng@amd.com</a>>;
                    Rob Herring <<a href="mailto:robh@kernel.org" moz-do-not-send="true">robh@kernel.org</a>>;
                    Tomeu
                    <br>
                    >> Vizoso <<a href="mailto:tomeu.vizoso@collabora.com" moz-do-not-send="true">tomeu.vizoso@collabora.com</a>>;
                    Steven Price <<a href="mailto:steven.price@arm.com" moz-do-not-send="true">steven.price@arm.com</a>><br>
                    >> Subject: RE: [PATCH v3] drm/scheduler
                    re-insert Bailing job to avoid <br>
                    >> memleak<br>
                    >><br>
                    >> [AMD Public Use]<br>
                    >><br>
                    >> Hi, Rob/Tomeu/Steven,<br>
                    >><br>
                    >> Would you please help to review this patch
                    for panfrost driver?<br>
                    >><br>
                    >> Thanks,<br>
                    >> Jack Zhang<br>
                    >><br>
                    >> -----Original Message-----<br>
                    >> From: Jack Zhang <<a href="mailto:Jack.Zhang1@amd.com" moz-do-not-send="true">Jack.Zhang1@amd.com</a>><br>
                    >> Sent: Monday, March 15, 2021 1:21 PM<br>
                    >> To: <a href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                    <br>
                    >> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    Grodzovsky, Andrey
                    <br>
                    >> <<a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true">Andrey.Grodzovsky@amd.com</a>>;
                    Liu, Monk <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>>;
                    Deng,
                    <br>
                    >> Emily <<a href="mailto:Emily.Deng@amd.com" moz-do-not-send="true">Emily.Deng@amd.com</a>><br>
                    >> Cc: Zhang, Jack (Jian) <<a href="mailto:Jack.Zhang1@amd.com" moz-do-not-send="true">Jack.Zhang1@amd.com</a>><br>
                    >> Subject: [PATCH v3] drm/scheduler re-insert
                    Bailing job to avoid memleak<br>
                    >><br>
                    >> re-insert Bailing jobs to avoid memory
                    leak.<br>
                    >><br>
                    >> V2: move re-insert step to drm/scheduler
                    logic<br>
                    >> V3: add panfrost's return value for bailing
                    jobs in case it hits the <br>
                    >> memleak issue.<br>
                    >><br>
                    >> Signed-off-by: Jack Zhang <<a href="mailto:Jack.Zhang1@amd.com" moz-do-not-send="true">Jack.Zhang1@amd.com</a>><br>
                    >> ---<br>
                    >>  
                    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-<br>
                    >>  
                    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8
                    ++++++--<br>
                    >>  
                    drivers/gpu/drm/panfrost/panfrost_job.c    | 4 ++--<br>
                    >>  
                    drivers/gpu/drm/scheduler/sched_main.c     | 8
                    +++++++-<br>
                    >>  
                    include/drm/gpu_scheduler.h                | 1 +<br>
                    >>   5 files changed, 19 insertions(+), 6
                    deletions(-)<br>
                    >><br>
                    >> diff --git
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
                    >>
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    >> index 79b9cc73763f..86463b0f936e 100644<br>
                    >> ---
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    >> +++
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    >> @@ -4815,8 +4815,10 @@ int
                    amdgpu_device_gpu_recover(struct <br>
                    >> amdgpu_device *adev,<br>
                    >>                       job ? job->base.id
                    : -1);<br>
                    >>             /* even we skipped this reset,
                    still need to set the job <br>
                    >> to guilty */<br>
                    >> -        if (job)<br>
                    >> +        if (job) {<br>
                    >>              
                    drm_sched_increase_karma(&job->base);<br>
                    >> +            r =
                    DRM_GPU_SCHED_STAT_BAILING;<br>
                    >> +        }<br>
                    >>           goto skip_recovery;<br>
                    >>       }<br>
                    >>   diff --git
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c <br>
                    >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
                    >> index 759b34799221..41390bdacd9e 100644<br>
                    >> ---
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
                    >> +++
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
                    >> @@ -34,6 +34,7 @@ static enum
                    drm_gpu_sched_stat <br>
                    >> amdgpu_job_timedout(struct drm_sched_job
                    *s_job)<br>
                    >>       struct amdgpu_job *job =
                    to_amdgpu_job(s_job);<br>
                    >>       struct amdgpu_task_info ti;<br>
                    >>       struct amdgpu_device *adev =
                    ring->adev;<br>
                    >> +    int ret;<br>
                    >>         memset(&ti, 0, sizeof(struct
                    amdgpu_task_info));<br>
                    >>   @@ -52,8 +53,11 @@ static enum
                    drm_gpu_sched_stat <br>
                    >> amdgpu_job_timedout(struct drm_sched_job
                    *s_job)<br>
                    >>             ti.process_name, ti.tgid,
                    ti.task_name, ti.pid);<br>
                    >>         if
                    (amdgpu_device_should_recover_gpu(ring->adev)) {<br>
                    >> -       
                    amdgpu_device_gpu_recover(ring->adev, job);<br>
                    >> -        return DRM_GPU_SCHED_STAT_NOMINAL;<br>
                    >> +        ret =
                    amdgpu_device_gpu_recover(ring->adev, job);<br>
                    >> +        if (ret ==
                    DRM_GPU_SCHED_STAT_BAILING)<br>
                    >> +            return
                    DRM_GPU_SCHED_STAT_BAILING;<br>
                    >> +        else<br>
                    >> +            return
                    DRM_GPU_SCHED_STAT_NOMINAL;<br>
                    >>       } else {<br>
                    >>          
                    drm_sched_suspend_timeout(&ring->sched);<br>
                    >>           if (amdgpu_sriov_vf(adev))<br>
                    >> diff --git
                    a/drivers/gpu/drm/panfrost/panfrost_job.c <br>
                    >> b/drivers/gpu/drm/panfrost/panfrost_job.c<br>
                    >> index 6003cfeb1322..e2cb4f32dae1 100644<br>
                    >> ---
                    a/drivers/gpu/drm/panfrost/panfrost_job.c<br>
                    >> +++
                    b/drivers/gpu/drm/panfrost/panfrost_job.c<br>
                    >> @@ -444,7 +444,7 @@ static enum
                    drm_gpu_sched_stat <br>
                    >> panfrost_job_timedout(struct drm_sched_job<br>
                    >>        * spurious. Bail out.<br>
                    >>        */<br>
                    >>       if
                    (dma_fence_is_signaled(job->done_fence))<br>
                    >> -        return DRM_GPU_SCHED_STAT_NOMINAL;<br>
                    >> +        return DRM_GPU_SCHED_STAT_BAILING;<br>
                    >>         dev_err(pfdev->dev, "gpu sched
                    timeout, js=%d, config=0x%x, <br>
                    >> status=0x%x, head=0x%x, tail=0x%x,
                    sched_job=%p",<br>
                    >>           js,<br>
                    >> @@ -456,7 +456,7 @@ static enum
                    drm_gpu_sched_stat <br>
                    >> panfrost_job_timedout(struct drm_sched_job<br>
                    >>         /* Scheduler is already stopped,
                    nothing to do. */<br>
                    >>       if
                    (!panfrost_scheduler_stop(&pfdev->js->queue[js],
                    sched_job))<br>
                    >> -        return DRM_GPU_SCHED_STAT_NOMINAL;<br>
                    >> +        return DRM_GPU_SCHED_STAT_BAILING;<br>
                    >>         /* Schedule a reset if there's no
                    reset in progress. */<br>
                    >>       if
                    (!atomic_xchg(&pfdev->reset.pending, 1)) diff
                    --git <br>
                    >> a/drivers/gpu/drm/scheduler/sched_main.c <br>
                    >> b/drivers/gpu/drm/scheduler/sched_main.c<br>
                    >> index 92d8de24d0a1..a44f621fb5c4 100644<br>
                    >> ---
                    a/drivers/gpu/drm/scheduler/sched_main.c<br>
                    >> +++
                    b/drivers/gpu/drm/scheduler/sched_main.c<br>
                    >> @@ -314,6 +314,7 @@ static void
                    drm_sched_job_timedout(struct <br>
                    >> work_struct *work)  {<br>
                    >>       struct drm_gpu_scheduler *sched;<br>
                    >>       struct drm_sched_job *job;<br>
                    >> +    int ret;<br>
                    >>         sched = container_of(work, struct
                    drm_gpu_scheduler, <br>
                    >> work_tdr.work);<br>
                    >>   @@ -331,8 +332,13 @@ static void
                    drm_sched_job_timedout(struct <br>
                    >> work_struct *work)<br>
                    >>           list_del_init(&job->list);<br>
                    >>          
                    spin_unlock(&sched->job_list_lock);<br>
                    >>   -       
                    job->sched->ops->timedout_job(job);<br>
                    >> +        ret =
                    job->sched->ops->timedout_job(job);<br>
                    >>   +        if (ret ==
                    DRM_GPU_SCHED_STAT_BAILING) {<br>
                    >> +           
                    spin_lock(&sched->job_list_lock);<br>
                    >> +            list_add(&job->node,
                    &sched->ring_mirror_list);<br>
                    >> +           
                    spin_unlock(&sched->job_list_lock);<br>
                    >> +        }<br>
                    <br>
                    <br>
                    At this point we don't hold GPU reset locks anymore,
                    and so we could<br>
                    be racing against another TDR thread from another
                    scheduler ring of same <br>
                    device<br>
                    or another XGMI hive member. The other thread might
                    be in the middle of <br>
                    luckless<br>
                    iteration of mirror list (drm_sched_stop,
                    drm_sched_start and <br>
                    drm_sched_resubmit)<br>
                    and so locking job_list_lock will not help. Looks
                    like it's required to <br>
                    take all GPU rest locks<br>
                    here.<br>
                    <br>
                    Andrey<br>
                    <br>
                    <br>
                    >>           /*<br>
                    >>            * Guilty job did complete and
                    hence needs to be manually <br>
                    >> removed<br>
                    >>            * See drm_sched_stop doc.<br>
                    >> diff --git a/include/drm/gpu_scheduler.h <br>
                    >> b/include/drm/gpu_scheduler.h index
                    4ea8606d91fe..8093ac2427ef 100644<br>
                    >> --- a/include/drm/gpu_scheduler.h<br>
                    >> +++ b/include/drm/gpu_scheduler.h<br>
                    >> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat
                    {<br>
                    >>       DRM_GPU_SCHED_STAT_NONE, /* Reserve 0
                    */<br>
                    >>       DRM_GPU_SCHED_STAT_NOMINAL,<br>
                    >>       DRM_GPU_SCHED_STAT_ENODEV,<br>
                    >> +    DRM_GPU_SCHED_STAT_BAILING,<br>
                    >>   };<br>
                    >>     /**<br>
                    >> -- <br>
                    >> 2.25.1<br>
                    >>
                    _______________________________________________<br>
                    >> amd-gfx mailing list<br>
                    >> <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                    >> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CJack.Zhang1%40amd.com%7C95b2ff206ee74bbe520a08d8e956f5dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637515907000888939%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BGoSfOYiDar8SrpMx%2BsOMWpaMr87bxB%2F9ycu0FhhipA%3D&reserved=0" moz-do-not-send="true">
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Ce90f30af0f43444c6aea08d8e91860c4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637515638213180413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NnLqtz%2BZ8%2BweYwCqRinrfkqmhzibNAF6CYSdVqL6xi0%3D&amp;reserved=0</a>
                    <br>
                    >><br>
                    ><o:p></o:p></p>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>