<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 11/13/19 9:20 AM, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:0858ea1b-d205-006d-a6ec-24b78b33e45b@amd.com">
      
      <div class="moz-cite-prefix">Another more fundamental question:
        Could we get rid of the timeout job at all?<br>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>There are other stuff there besides picking the first unfinished
      job which is common for all the drivers - such as freeing guilty
      signaled job and rearming the timeout work timer.</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:0858ea1b-d205-006d-a6ec-24b78b33e45b@amd.com">
      <div class="moz-cite-prefix"> <br>
        I mean we used to give this as parameter to the scheduler
        callback because we had the timeout worker in the job, but that
        is no longer the case.<br>
        <br>
        E.g. in drm_sched_job_timedout() we do the following:<br>
        <blockquote type="cite">        job =
          list_first_entry_or_null(&sched->ring_mirror_list,<br>
                                                 struct drm_sched_job,
          node);<br>
        </blockquote>
        <br>
        Why don't we just remove that here and only get the first job
        after we have stopped the scheduler?<br>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>Should be ok since we have the extra check for
      __kthread_should_park in <font size="2" face="Calibri"><span style="font-size:11pt;"> drm_sched_cleanup_jobs which will
          protect us in this case from a wakeup of sched thread and
          execution of </span></font><font size="2" face="Calibri"><span style="font-size:11pt;">in <font size="2" face="Calibri"><span style="font-size:11pt;"> drm_sched_cleanup_jobs </span></font>after
          we already parked it. The problem here is we need the
          drm_sched_job to access the private data for each client
          driver (see amdgpu_job_timedout for example). What about
          instead of peeking at the job to actually remove it from
          ring_mirror_list right there, go ahead with it through the
          reset routine, if it's signaled in the meanwhile that great -
          release it, otherwise put it back into ring_mirror_list in
          drm_sched_resubmit_jobs.<br>
        </span></font></p>
    <p>Andrey</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:0858ea1b-d205-006d-a6ec-24b78b33e45b@amd.com">
      <div class="moz-cite-prefix"> <br>
        Regards,<br>
        Christian.<br>
        <br>
        Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:<br>
      </div>
      <blockquote type="cite" cite="mid:40bb3114-d996-10af-3140-51a4f7c212d6@amd.com">
        <p>This why I asked for a trace with timer enabled, but since
          there is a finite number of places we touch the timer Emily
          can just put prints there. Also, I wonder if this temp fix
          helps her with the issue or not.</p>
        <p>Andrey<br>
        </p>
        <div class="moz-cite-prefix">On 11/13/19 2:36 AM, Christian
          König wrote:<br>
        </div>
        <blockquote type="cite" cite="mid:33ffe2f1-32b6-a238-3752-cee67cd9e141@gmail.com">
          <div class="moz-cite-prefix">The question is where do we rearm
            the timer for this problem to occur?<br>
            <br>
            Regards,<br>
            Christian. <br>
            <br>
            Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:<br>
          </div>
          <blockquote type="cite" cite="mid:9269d447-ed32-81f7-ab43-cb16139096e2@amd.com">
            <p>I was able to reproduce the crash by using the attached
              simulate_crash.patch - waiting on guilty job to signal in
              reset work and artificially rearming the timeout timer
              just before the check for
              !cancel_delayed_work(&sched->work_tdr)  in
              drm_sched_cleanup_jobs - crash log attached in crash.log.
              This I think confirms my theory i described earlier in
              this thread.</p>
            <p>basic_fix.patch handles this by testing whether another
              timer already armed ob this scheduler or is there a
              timeout work in execution right now (see documentation for
              work_busy) - obviously  this is not a full solution as
              this will not protect from races if for example there is
              immediate work scheduling such as in <font size="2" face="Calibri"><span style="font-size:11pt;">drm_sched_fault
                  -  so we probably need to account for this by making
                  drm_sched_cleanup_jobs (at least in the part where it
                  iterates ring mirror list and frees jobs) and GPU
                  reset really mutually exclusive and not like now.</span></font></p>
            <p><font size="2" face="Calibri"><span style="font-size:11pt;">Andrey<br>
                </span></font></p>
            <p><br>
            </p>
            <div class="moz-cite-prefix">On 11/11/19 4:11 PM, Christian
              König wrote:<br>
            </div>
            <blockquote type="cite" cite="mid:2f035f22-4057-dd9e-27ef-0f5612113e29@amd.com">
              <div class="moz-cite-prefix">Hi Emily,<br>
                <br>
                you need to print which scheduler instance is freeing
                the jobs and which one is triggering the reset. The TID
                and PID is completely meaningless here since we are
                called from different worker threads and the TID/PID can
                change on each call.<br>
                <br>
                Apart from that I will look into this a bit deeper when
                I have time.<br>
                <br>
                Regards,<br>
                Christian.<br>
                <br>
                Am 12.11.19 um 07:02 schrieb Deng, Emily:<br>
              </div>
              <blockquote type="cite" cite="mid:MN2PR12MB29750EDB35E27DF9CD63152C8F770@MN2PR12MB2975.namprd12.prod.outlook.com">
                <meta name="Generator" content="Microsoft Exchange
                  Server">
                <!-- converted from rtf -->
                <style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
                <font size="2" face="Calibri"><span style="font-size:11pt;">
                    <div>Hi Christian,</div>
                    <div>    I add the follow print in function
                      drm_sched_cleanup_jobs. From the log it shows that
                      only use cancel_delayed_work could not avoid to
                      free job when the sched is in reset. But don’t
                      know exactly where it is wrong about the driver.
                      Do you have any suggestion about this?</div>
                    <div><font face="Times New Roman"> </font></div>
                    <div>+      
                      printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
                      pid:%lu\n", current->tgid, current->pid);</div>
                    <div> </div>
                    <div>        /*</div>
                    <div>         * Don't destroy jobs while the timeout
                      worker is running  OR thread</div>
                    <div>         * is being parked and hence assumed to
                      not touch ring_mirror_list</div>
                    <div>         */</div>
                    <div>         if ((sched->timeout !=
                      MAX_SCHEDULE_TIMEOUT &&</div>
                    <div>           
                      !cancel_delayed_work(&sched->work_tdr)))</div>
                    <div>                return;</div>
                    <div>+      
                      printk("Emily:drm_sched_cleanup_jobs,tid:%lu,
                      pid:%lu\n", current->tgid, current->pid);</div>
                    <div><font face="Times New Roman"> </font></div>
                    <div><font face="Times New Roman"> </font></div>
                    <div><font face="Calibri">Best wishes</font></div>
                    <div><font face="Calibri">Emily Deng</font></div>
                    <div><font face="Times New Roman"> </font></div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11380.695091]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11380.695104]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11380.695105]
                      Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11380.695107]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11380.695107]
                      Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.222954] [drm:amdgpu_job_timedout [amdgpu]]
                      *ERROR* ring sdma0 timeout, signaled seq=78585,
                      emitted seq=78587</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.224275] [drm:amdgpu_job_timedout [amdgpu]]
                      *ERROR* Process information: process  pid 0
                      thread  pid 0, s_job:00000000fe75ab36,tid=15603,
                      pid=15603</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225413] amdgpu 0000:00:08.0: <font color="#00B050">GPU reset begin</font>!</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225417]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225425]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225425]
                      Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225428] Emily:<font color="#00B050">amdgpu_job_free_cb</font>,Process
                      information: process  pid 0 thread  pid 0,
                      s_job:00000000fe75ab36, tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225429]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225430]
                      Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225473]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2253,
                      pid:2253</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225486]
                      Emily:drm_sched_cleanup_jobs:begin,tid:2262,
                      pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225489]
                      Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
                    <div>Nov 12 12:58:20
                      ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
                      [11381.225494] Emily:amdgpu_job_free_cb,Process
                      information: process  pid 0 thread  pid 0,
                      s_job:00000000f086ec84, tid:2262, pid:2262</div>
                    <div>>-----Original Message-----</div>
                    <div>>From: Grodzovsky, Andrey <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true"><Andrey.Grodzovsky@amd.com></a></div>
                    <div>>Sent: Tuesday, November 12, 2019 11:28 AM</div>
                    <div>>To: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
                      Deng, Emily</div>
                    <div>><a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
                      <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>>Subject: Re: [PATCH] drm/amdgpu: Fix the
                      null pointer issue for tdr</div>
                    <div>></div>
                    <div>>Thinking more about this claim - we assume
                      here that if cancel_delayed_work</div>
                    <div>>returned true it guarantees that timeout
                      work is not running but, it merely</div>
                    <div>>means there was a pending timeout work
                      which was removed from the</div>
                    <div>>workqueue before it's timer elapsed and so
                      it didn't have a chance to be</div>
                    <div>>dequeued and executed, it doesn't cover
                      already executing work. So there is a</div>
                    <div>>possibility where while timeout work
                      started executing another timeout work</div>
                    <div>>already got enqueued (maybe through earlier
                      cleanup jobs or through</div>
                    <div>>drm_sched_fault) and if at this point
                      another drm_sched_cleanup_jobs runs</div>
                    <div>>cancel_delayed_work(&sched->work_tdr)
                      will return true even while there is a</div>
                    <div>>timeout job in progress.</div>
                    <div>>Unfortunately we cannot change
                      cancel_delayed_work to</div>
                    <div>>cancel_delayed_work_sync to flush the
                      timeout work as timeout work itself</div>
                    <div>>waits for schedule thread  to be parked
                      again when calling park_thread.</div>
                    <div>></div>
                    <div>>Andrey</div>
                    <div>></div>
                    <div>>________________________________________</div>
                    <div>>From: amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx-bounces@lists.freedesktop.org></a>
                      on behalf of</div>
                    <div>>Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
                    <div>>Sent: 08 November 2019 05:35:18</div>
                    <div>>To: Deng, Emily; <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>>Subject: Re: [PATCH] drm/amdgpu: Fix the
                      null pointer issue for tdr</div>
                    <div>></div>
                    <div>>Hi Emily,</div>
                    <div>></div>
                    <div>>exactly that can't happen. See here:</div>
                    <div>></div>
                    <div>>>         /* Don't destroy jobs while
                      the timeout worker is running */</div>
                    <div>>>         if (sched->timeout !=
                      MAX_SCHEDULE_TIMEOUT &&</div>
                    <div>>>            
                      !cancel_delayed_work(&sched->work_tdr))</div>
                    <div>>>                 return NULL;</div>
                    <div>></div>
                    <div>>We never free jobs while the timeout
                      working is running to prevent exactly</div>
                    <div>>that issue.</div>
                    <div>></div>
                    <div>>Regards,</div>
                    <div>>Christian.</div>
                    <div>></div>
                    <div>>Am 08.11.19 um 11:32 schrieb Deng, Emily:</div>
                    <div>>> Hi Christian,</div>
                    <div>>>       The drm_sched_job_timedout->
                      amdgpu_job_timedout call</div>
                    <div>>amdgpu_device_gpu_recover. I mean the main
                      scheduler free the jobs while</div>
                    <div>>in amdgpu_device_gpu_recover, and before
                      calling drm_sched_stop.</div>
                    <div>>></div>
                    <div>>> Best wishes</div>
                    <div>>> Emily Deng</div>
                    <div>>></div>
                    <div>>></div>
                    <div>>></div>
                    <div>>>> -----Original Message-----</div>
                    <div>>>> From: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
                    <div>>>> Sent: Friday, November 8, 2019
                      6:26 PM</div>
                    <div>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
                      <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>>>> Subject: Re: [PATCH] drm/amdgpu:
                      Fix the null pointer issue for tdr</div>
                    <div>>>></div>
                    <div>>>> Hi Emily,</div>
                    <div>>>></div>
                    <div>>>> well who is calling
                      amdgpu_device_gpu_recover() in this case?</div>
                    <div>>>></div>
                    <div>>>> When it's not the scheduler we
                      shouldn't have a guilty job in the first place.</div>
                    <div>>>></div>
                    <div>>>> Regards,</div>
                    <div>>>> Christian.</div>
                    <div>>>></div>
                    <div>>>> Am 08.11.19 um 11:22 schrieb Deng,
                      Emily:</div>
                    <div>>>>> Hi Chrisitan,</div>
                    <div>>>>>        No, I am with the new
                      branch and also has the patch. Even it</div>
                    <div>>>>> are freed by</div>
                    <div>>>> main scheduler, how we could avoid
                      main scheduler to free jobs while</div>
                    <div>>>> enter to function
                      amdgpu_device_gpu_recover?</div>
                    <div>>>>> Best wishes</div>
                    <div>>>>> Emily Deng</div>
                    <div>>>>></div>
                    <div>>>>></div>
                    <div>>>>></div>
                    <div>>>>>> -----Original Message-----</div>
                    <div>>>>>> From: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
                    <div>>>>>> Sent: Friday, November 8,
                      2019 6:15 PM</div>
                    <div>>>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
                      amd-</div>
                    <div>><a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
                    <div>>>>>> Subject: Re: [PATCH]
                      drm/amdgpu: Fix the null pointer issue for tdr</div>
                    <div>>>>>></div>
                    <div>>>>>> Hi Emily,</div>
                    <div>>>>>></div>
                    <div>>>>>> in this case you are on an
                      old code branch.</div>
                    <div>>>>>></div>
                    <div>>>>>> Jobs are freed now by the
                      main scheduler thread and only if no</div>
                    <div>>>>>> timeout handler is
                      running.</div>
                    <div>>>>>></div>
                    <div>>>>>> See this patch here:</div>
                    <div>>>>>>> commit
                      5918045c4ed492fb5813f980dcf89a90fefd0a4e</div>
                    <div>>>>>>> Author: Christian
                      König <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com" moz-do-not-send="true"><christian.koenig@amd.com></a></div>
                    <div>>>>>>> Date:   Thu Apr 18
                      11:00:21 2019 -0400</div>
                    <div>>>>>>></div>
                    <div>>>>>>>       drm/scheduler:
                      rework job destruction</div>
                    <div>>>>>> Regards,</div>
                    <div>>>>>> Christian.</div>
                    <div>>>>>></div>
                    <div>>>>>> Am 08.11.19 um 11:11
                      schrieb Deng, Emily:</div>
                    <div>>>>>>> Hi Christian,</div>
                    <div>>>>>>>         Please refer
                      to follow log, when it enter to</div>
                    <div>>>>>>>
                      amdgpu_device_gpu_recover</div>
                    <div>>>>>> function, the bad job
                      000000005086879e is freeing in function</div>
                    <div>>>>>> amdgpu_job_free_cb  at the
                      same time, because of the hardware fence</div>
                    <div>>>> signal.</div>
                    <div>>>>>> But
                      amdgpu_device_gpu_recover goes faster, at this
                      case, the</div>
                    <div>>>>>> s_fence is already freed,
                      but job is not freed in time. Then this issue</div>
                    <div>>occurs.</div>
                    <div>>>>>>> [  449.792189]
                      [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring</div>
                    <div>>>> sdma0</div>
                    <div>>>>>>> timeout, signaled
                      seq=2481, emitted seq=2483 [  449.793202]</div>
                    <div>>>>>>>
                      [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
                      information:</div>
                    <div>>>>>> process  pid 0 thread  pid
                      0, s_job:000000005086879e [  449.794163]</div>
                    <div>>>>>> amdgpu</div>
                    <div>>>>>> 0000:00:08.0: GPU reset
                      begin!</div>
                    <div>>>>>>> [  449.794175]
                      Emily:amdgpu_job_free_cb,Process information:</div>
                    <div>>>>>>> process pid 0 thread 
                      pid 0, s_job:000000005086879e [  449.794221]</div>
                    <div>>>>>>>
                      Emily:amdgpu_job_free_cb,Process information:
                      process  pid 0</div>
                    <div>>>>>>> thread pid 0,
                      s_job:0000000066eb74ab [  449.794222]</div>
                    <div>>>>>>>
                      Emily:amdgpu_job_free_cb,Process information:
                      process  pid 0</div>
                    <div>>>>>>> thread pid 0,
                      s_job:00000000d4438ad9 [  449.794255]</div>
                    <div>>>>>>>
                      Emily:amdgpu_job_free_cb,Process information:
                      process  pid 0</div>
                    <div>>>>>>> thread pid 0,
                      s_job:00000000b6d69c65 [  449.794257]</div>
                    <div>>>>>>>
                      Emily:amdgpu_job_free_cb,Process information:
                      process  pid 0</div>
                    <div>>>>>>> thread pid 0,</div>
                    <div>>>>>> s_job:00000000ea85e922 [ 
                      449.794287]</div>
                    <div>>>>>>
                      Emily:amdgpu_job_free_cb,Process</div>
                    <div>>>>>> information: process  pid
                      0 thread  pid 0, s_job:00000000ed3a5ac6 [</div>
                    <div>>>>>> 449.794366] BUG: unable to
                      handle kernel NULL pointer dereference</div>
                    <div>>>>>> at</div>
                    <div>>>>>> 00000000000000c0 [ 
                      449.800818] PGD 0 P4D 0 [  449.801040] Oops:</div>
                    <div>>>>>> 0000 [#1] SMP PTI</div>
                    <div>>>>>>> [  449.801338] CPU: 3
                      PID: 55 Comm: kworker/3:1 Tainted: G           OE</div>
                    <div>>>>>> 4.18.0-15-generic
                      #16~18.04.1-Ubuntu</div>
                    <div>>>>>>> [  449.802157]
                      Hardware name: QEMU Standard PC (i440FX + PIIX,</div>
                    <div>>>>>>> 1996), BIOS
                      Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]</div>
                    <div>>>>>>> Workqueue: events
                      drm_sched_job_timedout [amd_sched] [</div>
                    <div>>>>>>> 449.803488]</div>
                    <div>>>> RIP:</div>
                    <div>>>>>>
                      0010:amdgpu_device_gpu_recover+0x1da/0xb60
                      [amdgpu]</div>
                    <div>>>>>>> [  449.804020] Code:
                      dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff</div>
                    <div>>>>>>> ff</div>
                    <div>>>>>>> 45 85 e4 0f</div>
                    <div>>>>>> 85 a1 00 00 00 48 8b 45 b0
                      48 85 c0 0f 84 60 01 00 00 48 8b 40 10</div>
                    <div>>>>>> <48> 8b</div>
                    <div>>>> 98</div>
                    <div>>>>>> c0 00         00 00 48 85
                      db 0f 84 4c 01 00 00 48 8b 43 48 a8 01</div>
                    <div>>>>>>> [  449.805593] RSP:
                      0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [</div>
                    <div>>>>>>> 449.806032] RAX:
                      0000000000000000 RBX: 0000000000000000 RCX:</div>
                    <div>>>>>>> 0000000000000000 [ 
                      449.806625] RDX: ffffb4c7c08f5ac0 RSI:</div>
                    <div>>>>>>> 0000000fffffffe0 RDI:
                      0000000000000246 [  449.807224] RBP:</div>
                    <div>>>>>>> ffffb4c7c08f7de0 R08:
                      00000068b9d54000 R09: 0000000000000000 [</div>
                    <div>>>>>>> 449.807818] R10:
                      0000000000000000 R11: 0000000000000148 R12:</div>
                    <div>>>>>>> 0000000000000000 [ 
                      449.808411] R13: ffffb4c7c08f7da0 R14:</div>
                    <div>>>>>>> ffff8d82b8525d40 R15:
                      ffff8d82b8525d40 [  449.809004] FS:</div>
                    <div>>>>>>> 0000000000000000(0000)
                      GS:ffff8d82bfd80000(0000)</div>
                    <div>>>>>>> knlGS:0000000000000000
                      [  449.809674] CS:  0010 DS: 0000 ES: 0000</div>
                    <div>>CR0:</div>
                    <div>>>>>>> 0000000080050033 [ 
                      449.810153] CR2: 00000000000000c0 CR3:</div>
                    <div>>>>>>> 000000003cc0a001 CR4:
                      00000000003606e0 [  449.810747] DR0:</div>
                    <div>>>>>> 0000000000000000 DR1:
                      0000000000000000 DR2: 0000000000000000 [</div>
                    <div>>>>>> 449.811344] DR3:
                      0000000000000000 DR6: 00000000fffe0ff0 DR7:</div>
                    <div>>>>>> 0000000000000400 [ 
                      449.811937] Call Trace:</div>
                    <div>>>>>>> [  449.812206] 
                      amdgpu_job_timedout+0x114/0x140 [amdgpu] [</div>
                    <div>>>>>>> 449.812635] 
                      drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
                    <div>>>>>>> 449.813139]  ?
                      amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [</div>
                    <div>>>>>>> 449.813609]  ?
                      drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
                    <div>>>>>>> 449.814077] 
                      process_one_work+0x1fd/0x3f0 [  449.814417]</div>
                    <div>>>>>>>
                      worker_thread+0x34/0x410 [  449.814728] 
                      kthread+0x121/0x140 [</div>
                    <div>>>>>>> 449.815004]  ?
                      process_one_work+0x3f0/0x3f0 [  449.815374]  ?</div>
                    <div>>>>>>>
                      kthread_create_worker_on_cpu+0x70/0x70</div>
                    <div>>>>>>> [  449.815799] 
                      ret_from_fork+0x35/0x40</div>
                    <div>>>>>>></div>
                    <div>>>>>>>> -----Original
                      Message-----</div>
                    <div>>>>>>>> From: Koenig,
                      Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
                    <div>>>>>>>> Sent: Friday,
                      November 8, 2019 5:43 PM</div>
                    <div>>>>>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
                      amd-</div>
                    <div>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
                    <div>>>>>>>> Subject: Re:
                      [PATCH] drm/amdgpu: Fix the null pointer issue for</div>
                    <div>>>>>>>> tdr</div>
                    <div>>>>>>>></div>
                    <div>>>>>>>> Am 08.11.19 um
                      10:39 schrieb Deng, Emily:</div>
                    <div>>>>>>>>> Sorry, please
                      take your time.</div>
                    <div>>>>>>>> Have you seen my
                      other response a bit below?</div>
                    <div>>>>>>>></div>
                    <div>>>>>>>> I can't follow how
                      it would be possible for job->s_fence to be</div>
                    <div>>>>>>>> NULL without the
                      job also being freed.</div>
                    <div>>>>>>>></div>
                    <div>>>>>>>> So it looks like
                      this patch is just papering over some bigger
                      issues.</div>
                    <div>>>>>>>></div>
                    <div>>>>>>>> Regards,</div>
                    <div>>>>>>>> Christian.</div>
                    <div>>>>>>>></div>
                    <div>>>>>>>>> Best wishes</div>
                    <div>>>>>>>>> Emily Deng</div>
                    <div>>>>>>>>></div>
                    <div>>>>>>>>></div>
                    <div>>>>>>>>></div>
                    <div>>>>>>>>>>
                      -----Original Message-----</div>
                    <div>>>>>>>>>> From:
                      Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
                    <div>>>>>>>>>> Sent:
                      Friday, November 8, 2019 5:08 PM</div>
                    <div>>>>>>>>>> To: Deng,
                      Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
                      amd-</div>
                    <div>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
                    <div>>>>>>>>>> Subject:
                      Re: [PATCH] drm/amdgpu: Fix the null pointer issue
                      for</div>
                    <div>>>>>>>>>> tdr</div>
                    <div>>>>>>>>>></div>
                    <div>>>>>>>>>> Am
                      08.11.19 um 09:52 schrieb Deng, Emily:</div>
                    <div>>>>>>>>>>>
                      Ping.....</div>
                    <div>>>>>>>>>> You need
                      to give me at least enough time to wake up :)</div>
                    <div>>>>>>>>>></div>
                    <div>>>>>>>>>>> Best
                      wishes</div>
                    <div>>>>>>>>>>> Emily
                      Deng</div>
                    <div>>>>>>>>>>></div>
                    <div>>>>>>>>>>></div>
                    <div>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>
                      -----Original Message-----</div>
                    <div>>>>>>>>>>>>
                      From: amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx-bounces@lists.freedesktop.org></a>
                      On</div>
                    <div>>>> Behalf</div>
                    <div>>>>>>>>>>>> Of
                      Deng, Emily</div>
                    <div>>>>>>>>>>>>
                      Sent: Friday, November 8, 2019 10:56 AM</div>
                    <div>>>>>>>>>>>>
                      To: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
                      amd-</div>
                    <div>>>>>>>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
                    <div>>>>>>>>>>>>
                      Subject: RE: [PATCH] drm/amdgpu: Fix the null
                      pointer issue</div>
                    <div>>>>>>>>>>>>
                      for tdr</div>
                    <div>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>
                      -----Original Message-----</div>
                    <div>>>>>>>>>>>>>
                      From: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true"><ckoenig.leichtzumerken@gmail.com></a></div>
                    <div>>>>>>>>>>>>>
                      Sent: Thursday, November 7, 2019 7:28 PM</div>
                    <div>>>>>>>>>>>>>
                      To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;</div>
                    <div>>>>>>>>>>>>>
                      <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>>>>>>>>>>>>>
                      Subject: Re: [PATCH] drm/amdgpu: Fix the null
                      pointer issue</div>
                    <div>>>>>>>>>>>>>
                      for tdr</div>
                    <div>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>
                      Am 07.11.19 um 11:25 schrieb Emily Deng:</div>
                    <div>>>>>>>>>>>>>>
                      When the job is already signaled, the s_fence is
                      freed.</div>
                    <div>>>>>>>>>>>>>>
                      Then it will has null pointer in
                      amdgpu_device_gpu_recover.</div>
                    <div>>>>>>>>>>>>>
                      NAK, the s_fence is only set to NULL when the job
                      is destroyed.</div>
                    <div>>>>>>>>>>>>>
                      See drm_sched_job_cleanup().</div>
                    <div>>>>>>>>>>>> I
                      know it is set to NULL in drm_sched_job_cleanup.
                      But in one</div>
                    <div>>>>>>>>>>>>
                      case, when it enter into the
                      amdgpu_device_gpu_recover, it</div>
                    <div>>>>>>>>>>>>
                      already in drm_sched_job_cleanup, and at this
                      time, it will</div>
                    <div>>>>>>>>>>>> go
                      to free</div>
                    <div>>>>>> job.</div>
                    <div>>>>>>>>>>>>
                      But the amdgpu_device_gpu_recover sometimes is
                      faster. At</div>
                    <div>>>>>>>>>>>>
                      that time, job is not freed, but s_fence is
                      already NULL.</div>
                    <div>>>>>>>>>> No, that
                      case can't happen. See here:</div>
                    <div>>>>>>>>>></div>
                    <div>>>>>>>>>>>            
                      drm_sched_job_cleanup(s_job);</div>
                    <div>>>>>>>>>>></div>
                    <div>>>>>>>>>>>            
                      amdgpu_ring_priority_put(ring,
                      s_job->s_priority);</div>
                    <div>>>>>>>>>>>            
                      dma_fence_put(job->fence);</div>
                    <div>>>>>>>>>>>            
                      amdgpu_sync_free(&job->sync);</div>
                    <div>>>>>>>>>>>            
                      amdgpu_sync_free(&job->sched_sync);</div>
                    <div>>>>>>>>>>>            
                      kfree(job);</div>
                    <div>>>>>>>>>> The job
                      itself is freed up directly after freeing the
                      reference</div>
                    <div>>>>>>>>>> to the</div>
                    <div>>>>>> s_fence.</div>
                    <div>>>>>>>>>> So you are
                      just papering over a much bigger problem here.
                      This</div>
                    <div>>>>>>>>>> patch is a
                      clear NAK.</div>
                    <div>>>>>>>>>></div>
                    <div>>>>>>>>>> Regards,</div>
                    <div>>>>>>>>>> Christian.</div>
                    <div>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>
                      When you see a job without an s_fence then that
                      means the</div>
                    <div>>>>>>>>>>>>>
                      problem is somewhere else.</div>
                    <div>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>
                      Regards,</div>
                    <div>>>>>>>>>>>>>
                      Christian.</div>
                    <div>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>>
                      Signed-off-by: Emily Deng <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a></div>
                    <div>>>>>>>>>>>>>>
                      ---</div>
                    <div>>>>>>>>>>>>>>       
                      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-</div>
                    <div>>>>>>>>>>>>>>       
                      drivers/gpu/drm/scheduler/sched_main.c     | 11
                      ++++++---</div>
                    <div>>--</div>
                    <div>>>>>>>>>>>>>>       
                      2 files changed, 7 insertions(+), 6 deletions(-)</div>
                    <div>>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>>
                      diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
                    <div>>>>>>>>>>>>>>
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
                    <div>>>>>>>>>>>>>>
                      index e6ce949..5a8f08e 100644</div>
                    <div>>>>>>>>>>>>>>
                      --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
                    <div>>>>>>>>>>>>>>
                      +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
                    <div>>>>>>>>>>>>>>
                      @@ -4075,7 +4075,7 @@ int</div>
                    <div>>>> amdgpu_device_gpu_recover(struct</div>
                    <div>>>>>>>>>>>>>
                      amdgpu_device *adev,</div>
                    <div>>>>>>>>>>>>>>            
                      *</div>
                    <div>>>>>>>>>>>>>>            
                      * job->base holds a reference to parent fence</div>
                    <div>>>>>>>>>>>>>>            
                      */</div>
                    <div>>>>>>>>>>>>>>
                      -  if (job &&
                      job->base.s_fence->parent &&</div>
                    <div>>>>>>>>>>>>>>
                      +  if (job && job->base.s_fence
                      &&</div>
                    <div>>>>>>>>>>>>>>
                      + job->base.s_fence->parent</div>
                    <div>>>>>>>> &&</div>
                    <div>>>>>>>>>>>>>>               
dma_fence_is_signaled(job->base.s_fence->parent))</div>
                    <div>>>>>>>>>>>>>>                   
                      job_signaled = true;</div>
                    <div>>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>>
                      diff --git
                      a/drivers/gpu/drm/scheduler/sched_main.c</div>
                    <div>>>>>>>>>>>>>>
                      b/drivers/gpu/drm/scheduler/sched_main.c</div>
                    <div>>>>>>>>>>>>>>
                      index 31809ca..56cc10e 100644</div>
                    <div>>>>>>>>>>>>>>
                      --- a/drivers/gpu/drm/scheduler/sched_main.c</div>
                    <div>>>>>>>>>>>>>>
                      +++ b/drivers/gpu/drm/scheduler/sched_main.c</div>
                    <div>>>>>>>>>>>>>>
                      @@ -334,8 +334,8 @@ void</div>
                    <div>>drm_sched_increase_karma(struct</div>
                    <div>>>>>>>>>>>>>
                      drm_sched_job</div>
                    <div>>>>>>>>>>>>>>
                      *bad)</div>
                    <div>>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>>                           
                      spin_lock(&rq->lock);</div>
                    <div>>>>>>>>>>>>>>                           
                      list_for_each_entry_safe(entity,</div>
                    <div>>>>>>>>>>>>>>
                      tmp,</div>
                    <div>>>> &rq-</div>
                    <div>>>>>>>>> entities,</div>
                    <div>>>>>>>>>>>>>
                      list) {</div>
                    <div>>>>>>>>>>>>>>
                      -                          if
                      (bad->s_fence->scheduled.context</div>
                    <div>>>>>>>> ==</div>
                    <div>>>>>>>>>>>>>>
                      -                             
                      entity->fence_context) {</div>
                    <div>>>>>>>>>>>>>>
                      +                          if (bad->s_fence
                      &&</div>
                    <div>>>>>>>>>>>>>>
                      + (bad->s_fence-</div>
                    <div>>>>>>>>>>>>>>
                      scheduled.context ==</div>
                    <div>>>>>>>>>>>>>>
                      +                             
                      entity->fence_context)) {</div>
                    <div>>>>>>>>>>>>>>                                           
                      if</div>
                    <div>>>>>>>>>>>>>>
                      (atomic_read(&bad-</div>
                    <div>>>>>>>>> karma) ></div>
                    <div>>>>>>>>>>>>>>                                               
                      bad->sched-</div>
                    <div>>>>> hang_limit)</div>
                    <div>>>>>>>>>>>>>>                                                   
                      if</div>
                    <div>>>>>>>>>>>>>>
                      (entity-</div>
                    <div>>>>> guilty) @@ -376,7 +376,7 @@
                      void</div>
                    <div>>>>>>>>>>>>>>
                      drm_sched_stop(struct</div>
                    <div>>>>>>>> drm_gpu_scheduler</div>
                    <div>>>>>>>>>>>>>
                      *sched, struct drm_sched_job *bad)</div>
                    <div>>>>>>>>>>>>>>            
                      * This iteration is thread safe as sched thread</div>
                    <div>>>>>>>>>>>>>>
                      is</div>
                    <div>>>> stopped.</div>
                    <div>>>>>>>>>>>>>>            
                      */</div>
                    <div>>>>>>>>>>>>>>           
                      list_for_each_entry_safe_reverse(s_job, tmp,</div>
                    <div>>>>>>>>>>>>>>
                      &sched- ring_mirror_list, node) {</div>
                    <div>>>>>>>>>>>>>>
                      -          if (s_job->s_fence->parent
                      &&</div>
                    <div>>>>>>>>>>>>>>
                      +          if (s_job->s_fence &&
                      s_job->s_fence->parent &&</div>
                    <div>>>>>>>>>>>>>>                       
                      dma_fence_remove_callback(s_job-</div>
                    <div>>>>> s_fence-</div>
                    <div>>>>>>>>> parent,</div>
                    <div>>>>>>>>>>>>>>                                                 
                      &s_job->cb)) {</div>
                    <div>>>>>>>>>>>>>>                           
                      atomic_dec(&sched->hw_rq_count);</div>
                    <div>>>> @@ -</div>
                    <div>>>>>>>> 395,7</div>
                    <div>>>>>>>>>>>>
                      +395,8 @@ void</div>
                    <div>>>>>>>>>>>>>>
                      drm_sched_stop(struct drm_gpu_scheduler</div>
                    <div>>>>>>>>>>>>>
                      *sched, struct drm_sched_job *bad)</div>
                    <div>>>>>>>>>>>>>>                            
                      *</div>
                    <div>>>>>>>>>>>>>>                            
                      * Job is still alive so fence</div>
                    <div>>>>>>>>>>>>>>
                      refcount at</div>
                    <div>>>> least 1</div>
                    <div>>>>>>>>>>>>>>                            
                      */</div>
                    <div>>>>>>>>>>>>>>
                      -                 
                      dma_fence_wait(&s_job->s_fence->finished,</div>
                    <div>>>>>>>> false);</div>
                    <div>>>>>>>>>>>>>>
                      +                  if (s_job->s_fence)</div>
                    <div>>>>>>>>>>>>>>
                      +                         
                      dma_fence_wait(&s_job->s_fence-</div>
                    <div>>>>>>>>> finished,</div>
                    <div>>>>>>>>>>>>>
                      false);</div>
                    <div>>>>>>>>>>>>>>                           
                      /*</div>
                    <div>>>>>>>>>>>>>>                            
                      * We must keep bad job alive</div>
                    <div>>>>>>>>>>>>>>
                      for later</div>
                    <div>>>> use</div>
                    <div>>>>>>>> during @@</div>
                    <div>>>>>>>>>>>>>
                      -438,7</div>
                    <div>>>>>>>>>>>>>>
                      +439,7 @@ void drm_sched_start(struct
                      drm_gpu_scheduler</div>
                    <div>>>>>> *sched,</div>
                    <div>>>>>>>>>>>>>>
                      +bool</div>
                    <div>>>>>>>>>>>>>
                      full_recovery)</div>
                    <div>>>>>>>>>>>>>>            
                      * GPU recovers can't run in parallel.</div>
                    <div>>>>>>>>>>>>>>            
                      */</div>
                    <div>>>>>>>>>>>>>>           
                      list_for_each_entry_safe(s_job, tmp,</div>
                    <div>>>>>>>>>>>>>>
                      &sched->ring_mirror_list,</div>
                    <div>>>>>>>>>>>>>>
                      node)</div>
                    <div>>>>>>>>>>>>>
                      {</div>
                    <div>>>>>>>>>>>>>>
                      -          struct dma_fence *fence =
                      s_job->s_fence->parent;</div>
                    <div>>>>>>>>>>>>>>
                      +          struct dma_fence *fence =
                      s_job->s_fence ?</div>
                    <div>>>>>>>>>>>>>>
                      + s_job-</div>
                    <div>>>>>>>>> s_fence-</div>
                    <div>>>>>>>>>>>>>>
                      parent :</div>
                    <div>>>>>>>>>>>>>>
                      +NULL;</div>
                    <div>>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>>>                   
                      atomic_inc(&sched->hw_rq_count);</div>
                    <div>>>>>>>>>>>>>></div>
                    <div>>>>>>>>>>>>
                      _______________________________________________</div>
                    <div>>>>>>>>>>>>
                      amd-gfx mailing list</div>
                    <div>>>>>>>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>>>>>>>>>>>> <a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">
                        https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
                    <div>></div>
                    <div>>_______________________________________________</div>
                    <div>>amd-gfx mailing list</div>
                    <div>><a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
                    <div>><a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
                    <div><font face="Times New Roman"> </font></div>
                  </span></font> </blockquote>
              <br>
            </blockquote>
            <br>
            <fieldset class="mimeAttachmentHeader"></fieldset>
            <pre class="moz-quote-pre" wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></pre>
          </blockquote>
          <br>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>