<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hi Emily,<br>
      <br>
      you need to print which scheduler instance is freeing the jobs and
      which one is triggering the reset. The TID and PID is completely
      meaningless here since we are called from different worker threads
      and the TID/PID can change on each call.<br>
      <br>
      Apart from that I will look into this a bit deeper when I have
      time.<br>
      <br>
      Regards,<br>
      Christian.<br>
      <br>
      Am 12.11.19 um 07:02 schrieb Deng, Emily:<br>
    </div>
    <blockquote type="cite" cite="mid:MN2PR12MB29750EDB35E27DF9CD63152C8F770@MN2PR12MB2975.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Exchange Server">
      <!-- converted from rtf -->
      <style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
      <font size="2" face="Calibri"><span style="font-size:11pt;">
          <div>Hi Christian,</div>
          <div>    I add the follow print in function
            drm_sched_cleanup_jobs. From the log it shows that only use
            cancel_delayed_work could not avoid to free job when the
            sched is in reset. But don’t know exactly where it is wrong
            about the driver. Do you have any suggestion
            about this?</div>
          <div><font face="Times New Roman"> </font></div>
          <div>+      
            printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
            pid:%lu\n", current->tgid, current->pid);</div>
          <div> </div>
          <div>        /*</div>
          <div>         * Don't destroy jobs while the timeout worker is
            running  OR thread</div>
          <div>         * is being parked and hence assumed to not touch
            ring_mirror_list</div>
          <div>         */</div>
          <div>         if ((sched->timeout != MAX_SCHEDULE_TIMEOUT
            &&</div>
          <div>           
            !cancel_delayed_work(&sched->work_tdr)))</div>
          <div>                return;</div>
          <div>+       printk("Emily:drm_sched_cleanup_jobs,tid:%lu,
            pid:%lu\n", current->tgid, current->pid);</div>
          <div><font face="Times New Roman"> </font></div>
          <div><font face="Times New Roman"> </font></div>
          <div><font face="Calibri">Best wishes</font></div>
          <div><font face="Calibri">Emily Deng</font></div>
          <div><font face="Times New Roman"> </font></div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11380.695091]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11380.695104]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11380.695105]
            Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11380.695107]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11380.695107]
            Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.222954] [drm:amdgpu_job_timedout [amdgpu]]
            *ERROR* ring sdma0 timeout, signaled seq=78585, emitted
            seq=78587</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.224275] [drm:amdgpu_job_timedout [amdgpu]]
            *ERROR* Process information: process  pid 0 thread  pid 0,
            s_job:00000000fe75ab36,tid=15603, pid=15603</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225413] amdgpu 0000:00:08.0: <font color="#00B050">GPU reset begin</font>!</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225417]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225425]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225425]
            Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225428] Emily:<font color="#00B050">amdgpu_job_free_cb</font>,Process
            information: process  pid 0 thread  pid 0,
            s_job:00000000fe75ab36, tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225429]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225430]
            Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225473]
            Emily:drm_sched_cleanup_jobs:begin,tid:2253, pid:2253</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225486]
            Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225489]
            Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262</div>
          <div>Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
            kernel: [11381.225494] Emily:amdgpu_job_free_cb,Process
            information: process  pid 0 thread  pid 0,
            s_job:00000000f086ec84, tid:2262, pid:2262</div>
          <div>>-----Original Message-----</div>
          <div>>From: Grodzovsky, Andrey
            <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a></div>
          <div>>Sent: Tuesday, November 12, 2019 11:28 AM</div>
          <div>>To: Koenig, Christian
            <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Deng, Emily</div>
          <div>><a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>;
            <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
            issue for tdr</div>
          <div>></div>
          <div>>Thinking more about this claim - we assume here that
            if cancel_delayed_work</div>
          <div>>returned true it guarantees that timeout work is not
            running but, it merely</div>
          <div>>means there was a pending timeout work which was
            removed from the</div>
          <div>>workqueue before it's timer elapsed and so it didn't
            have a chance to be</div>
          <div>>dequeued and executed, it doesn't cover already
            executing work. So there is a</div>
          <div>>possibility where while timeout work started
            executing another timeout work</div>
          <div>>already got enqueued (maybe through earlier cleanup
            jobs or through</div>
          <div>>drm_sched_fault) and if at this point another
            drm_sched_cleanup_jobs runs</div>
          <div>>cancel_delayed_work(&sched->work_tdr) will
            return true even while there is a</div>
          <div>>timeout job in progress.</div>
          <div>>Unfortunately we cannot change cancel_delayed_work to</div>
          <div>>cancel_delayed_work_sync to flush the timeout work as
            timeout work itself</div>
          <div>>waits for schedule thread  to be parked again when
            calling park_thread.</div>
          <div>></div>
          <div>>Andrey</div>
          <div>></div>
          <div>>________________________________________</div>
          <div>>From: amd-gfx
            <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org"><amd-gfx-bounces@lists.freedesktop.org></a> on behalf of</div>
          <div>>Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a></div>
          <div>>Sent: 08 November 2019 05:35:18</div>
          <div>>To: Deng, Emily; <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
            issue for tdr</div>
          <div>></div>
          <div>>Hi Emily,</div>
          <div>></div>
          <div>>exactly that can't happen. See here:</div>
          <div>></div>
          <div>>>         /* Don't destroy jobs while the timeout
            worker is running */</div>
          <div>>>         if (sched->timeout !=
            MAX_SCHEDULE_TIMEOUT &&</div>
          <div>>>            
            !cancel_delayed_work(&sched->work_tdr))</div>
          <div>>>                 return NULL;</div>
          <div>></div>
          <div>>We never free jobs while the timeout working is
            running to prevent exactly</div>
          <div>>that issue.</div>
          <div>></div>
          <div>>Regards,</div>
          <div>>Christian.</div>
          <div>></div>
          <div>>Am 08.11.19 um 11:32 schrieb Deng, Emily:</div>
          <div>>> Hi Christian,</div>
          <div>>>       The drm_sched_job_timedout->
            amdgpu_job_timedout call</div>
          <div>>amdgpu_device_gpu_recover. I mean the main scheduler
            free the jobs while</div>
          <div>>in amdgpu_device_gpu_recover, and before calling
            drm_sched_stop.</div>
          <div>>></div>
          <div>>> Best wishes</div>
          <div>>> Emily Deng</div>
          <div>>></div>
          <div>>></div>
          <div>>></div>
          <div>>>> -----Original Message-----</div>
          <div>>>> From: Koenig, Christian
            <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a></div>
          <div>>>> Sent: Friday, November 8, 2019 6:26 PM</div>
          <div>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>;
            <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the
            null pointer issue for tdr</div>
          <div>>>></div>
          <div>>>> Hi Emily,</div>
          <div>>>></div>
          <div>>>> well who is calling
            amdgpu_device_gpu_recover() in this case?</div>
          <div>>>></div>
          <div>>>> When it's not the scheduler we shouldn't
            have a guilty job in the first place.</div>
          <div>>>></div>
          <div>>>> Regards,</div>
          <div>>>> Christian.</div>
          <div>>>></div>
          <div>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:</div>
          <div>>>>> Hi Chrisitan,</div>
          <div>>>>>        No, I am with the new branch and
            also has the patch. Even it</div>
          <div>>>>> are freed by</div>
          <div>>>> main scheduler, how we could avoid main
            scheduler to free jobs while</div>
          <div>>>> enter to function amdgpu_device_gpu_recover?</div>
          <div>>>>> Best wishes</div>
          <div>>>>> Emily Deng</div>
          <div>>>>></div>
          <div>>>>></div>
          <div>>>>></div>
          <div>>>>>> -----Original Message-----</div>
          <div>>>>>> From: Koenig, Christian
            <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a></div>
          <div>>>>>> Sent: Friday, November 8, 2019 6:15
            PM</div>
          <div>>>>>> To: Deng, Emily
            <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>; amd-</div>
          <div>><a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a></div>
          <div>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix
            the null pointer issue for tdr</div>
          <div>>>>>></div>
          <div>>>>>> Hi Emily,</div>
          <div>>>>>></div>
          <div>>>>>> in this case you are on an old code
            branch.</div>
          <div>>>>>></div>
          <div>>>>>> Jobs are freed now by the main
            scheduler thread and only if no</div>
          <div>>>>>> timeout handler is running.</div>
          <div>>>>>></div>
          <div>>>>>> See this patch here:</div>
          <div>>>>>>> commit
            5918045c4ed492fb5813f980dcf89a90fefd0a4e</div>
          <div>>>>>>> Author: Christian König
            <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com"><christian.koenig@amd.com></a></div>
          <div>>>>>>> Date:   Thu Apr 18 11:00:21 2019
            -0400</div>
          <div>>>>>>></div>
          <div>>>>>>>       drm/scheduler: rework job
            destruction</div>
          <div>>>>>> Regards,</div>
          <div>>>>>> Christian.</div>
          <div>>>>>></div>
          <div>>>>>> Am 08.11.19 um 11:11 schrieb Deng,
            Emily:</div>
          <div>>>>>>> Hi Christian,</div>
          <div>>>>>>>         Please refer to follow
            log, when it enter to</div>
          <div>>>>>>> amdgpu_device_gpu_recover</div>
          <div>>>>>> function, the bad job
            000000005086879e is freeing in function</div>
          <div>>>>>> amdgpu_job_free_cb  at the same
            time, because of the hardware fence</div>
          <div>>>> signal.</div>
          <div>>>>>> But amdgpu_device_gpu_recover goes
            faster, at this case, the</div>
          <div>>>>>> s_fence is already freed, but job is
            not freed in time. Then this issue</div>
          <div>>occurs.</div>
          <div>>>>>>> [  449.792189]
            [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring</div>
          <div>>>> sdma0</div>
          <div>>>>>>> timeout, signaled seq=2481,
            emitted seq=2483 [  449.793202]</div>
          <div>>>>>>> [drm:amdgpu_job_timedout
            [amdgpu]] *ERROR* Process information:</div>
          <div>>>>>> process  pid 0 thread  pid 0,
            s_job:000000005086879e [  449.794163]</div>
          <div>>>>>> amdgpu</div>
          <div>>>>>> 0000:00:08.0: GPU reset begin!</div>
          <div>>>>>>> [  449.794175]
            Emily:amdgpu_job_free_cb,Process information:</div>
          <div>>>>>>> process pid 0 thread  pid 0,
            s_job:000000005086879e [  449.794221]</div>
          <div>>>>>>> Emily:amdgpu_job_free_cb,Process
            information: process  pid 0</div>
          <div>>>>>>> thread pid 0,
            s_job:0000000066eb74ab [  449.794222]</div>
          <div>>>>>>> Emily:amdgpu_job_free_cb,Process
            information: process  pid 0</div>
          <div>>>>>>> thread pid 0,
            s_job:00000000d4438ad9 [  449.794255]</div>
          <div>>>>>>> Emily:amdgpu_job_free_cb,Process
            information: process  pid 0</div>
          <div>>>>>>> thread pid 0,
            s_job:00000000b6d69c65 [  449.794257]</div>
          <div>>>>>>> Emily:amdgpu_job_free_cb,Process
            information: process  pid 0</div>
          <div>>>>>>> thread pid 0,</div>
          <div>>>>>> s_job:00000000ea85e922 [ 
            449.794287]</div>
          <div>>>>>> Emily:amdgpu_job_free_cb,Process</div>
          <div>>>>>> information: process  pid 0 thread 
            pid 0, s_job:00000000ed3a5ac6 [</div>
          <div>>>>>> 449.794366] BUG: unable to handle
            kernel NULL pointer dereference</div>
          <div>>>>>> at</div>
          <div>>>>>> 00000000000000c0 [  449.800818] PGD
            0 P4D 0 [  449.801040] Oops:</div>
          <div>>>>>> 0000 [#1] SMP PTI</div>
          <div>>>>>>> [  449.801338] CPU: 3 PID: 55
            Comm: kworker/3:1 Tainted: G           OE</div>
          <div>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu</div>
          <div>>>>>>> [  449.802157] Hardware name:
            QEMU Standard PC (i440FX + PIIX,</div>
          <div>>>>>>> 1996), BIOS
            Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]</div>
          <div>>>>>>> Workqueue: events
            drm_sched_job_timedout [amd_sched] [</div>
          <div>>>>>>> 449.803488]</div>
          <div>>>> RIP:</div>
          <div>>>>>>
            0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]</div>
          <div>>>>>>> [  449.804020] Code: dd ff ff 49
            39 c5 48 89 55 a8 0f 85 56 ff ff</div>
          <div>>>>>>> ff</div>
          <div>>>>>>> 45 85 e4 0f</div>
          <div>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0
            0f 84 60 01 00 00 48 8b 40 10</div>
          <div>>>>>> <48> 8b</div>
          <div>>>> 98</div>
          <div>>>>>> c0 00         00 00 48 85 db 0f 84
            4c 01 00 00 48 8b 43 48 a8 01</div>
          <div>>>>>>> [  449.805593] RSP:
            0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [</div>
          <div>>>>>>> 449.806032] RAX:
            0000000000000000 RBX: 0000000000000000 RCX:</div>
          <div>>>>>>> 0000000000000000 [  449.806625]
            RDX: ffffb4c7c08f5ac0 RSI:</div>
          <div>>>>>>> 0000000fffffffe0 RDI:
            0000000000000246 [  449.807224] RBP:</div>
          <div>>>>>>> ffffb4c7c08f7de0 R08:
            00000068b9d54000 R09: 0000000000000000 [</div>
          <div>>>>>>> 449.807818] R10:
            0000000000000000 R11: 0000000000000148 R12:</div>
          <div>>>>>>> 0000000000000000 [  449.808411]
            R13: ffffb4c7c08f7da0 R14:</div>
          <div>>>>>>> ffff8d82b8525d40 R15:
            ffff8d82b8525d40 [  449.809004] FS:</div>
          <div>>>>>>> 0000000000000000(0000)
            GS:ffff8d82bfd80000(0000)</div>
          <div>>>>>>> knlGS:0000000000000000 [ 
            449.809674] CS:  0010 DS: 0000 ES: 0000</div>
          <div>>CR0:</div>
          <div>>>>>>> 0000000080050033 [  449.810153]
            CR2: 00000000000000c0 CR3:</div>
          <div>>>>>>> 000000003cc0a001 CR4:
            00000000003606e0 [  449.810747] DR0:</div>
          <div>>>>>> 0000000000000000 DR1:
            0000000000000000 DR2: 0000000000000000 [</div>
          <div>>>>>> 449.811344] DR3: 0000000000000000
            DR6: 00000000fffe0ff0 DR7:</div>
          <div>>>>>> 0000000000000400 [  449.811937] Call
            Trace:</div>
          <div>>>>>>> [  449.812206] 
            amdgpu_job_timedout+0x114/0x140 [amdgpu] [</div>
          <div>>>>>>> 449.812635] 
            drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
          <div>>>>>>> 449.813139]  ?
            amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [</div>
          <div>>>>>>> 449.813609]  ?
            drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
          <div>>>>>>> 449.814077] 
            process_one_work+0x1fd/0x3f0 [  449.814417]</div>
          <div>>>>>>> worker_thread+0x34/0x410 [ 
            449.814728]  kthread+0x121/0x140 [</div>
          <div>>>>>>> 449.815004]  ?
            process_one_work+0x3f0/0x3f0 [  449.815374]  ?</div>
          <div>>>>>>>
            kthread_create_worker_on_cpu+0x70/0x70</div>
          <div>>>>>>> [  449.815799] 
            ret_from_fork+0x35/0x40</div>
          <div>>>>>>></div>
          <div>>>>>>>> -----Original Message-----</div>
          <div>>>>>>>> From: Koenig, Christian
            <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a></div>
          <div>>>>>>>> Sent: Friday, November 8,
            2019 5:43 PM</div>
          <div>>>>>>>> To: Deng, Emily
            <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>; amd-</div>
          <div>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a></div>
          <div>>>>>>>> Subject: Re: [PATCH]
            drm/amdgpu: Fix the null pointer issue for</div>
          <div>>>>>>>> tdr</div>
          <div>>>>>>>></div>
          <div>>>>>>>> Am 08.11.19 um 10:39 schrieb
            Deng, Emily:</div>
          <div>>>>>>>>> Sorry, please take your
            time.</div>
          <div>>>>>>>> Have you seen my other
            response a bit below?</div>
          <div>>>>>>>></div>
          <div>>>>>>>> I can't follow how it would
            be possible for job->s_fence to be</div>
          <div>>>>>>>> NULL without the job also
            being freed.</div>
          <div>>>>>>>></div>
          <div>>>>>>>> So it looks like this patch
            is just papering over some bigger issues.</div>
          <div>>>>>>>></div>
          <div>>>>>>>> Regards,</div>
          <div>>>>>>>> Christian.</div>
          <div>>>>>>>></div>
          <div>>>>>>>>> Best wishes</div>
          <div>>>>>>>>> Emily Deng</div>
          <div>>>>>>>>></div>
          <div>>>>>>>>></div>
          <div>>>>>>>>></div>
          <div>>>>>>>>>> -----Original
            Message-----</div>
          <div>>>>>>>>>> From: Koenig,
            Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a></div>
          <div>>>>>>>>>> Sent: Friday,
            November 8, 2019 5:08 PM</div>
          <div>>>>>>>>>> To: Deng, Emily
            <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>; amd-</div>
          <div>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a></div>
          <div>>>>>>>>>> Subject: Re: [PATCH]
            drm/amdgpu: Fix the null pointer issue for</div>
          <div>>>>>>>>>> tdr</div>
          <div>>>>>>>>>></div>
          <div>>>>>>>>>> Am 08.11.19 um 09:52
            schrieb Deng, Emily:</div>
          <div>>>>>>>>>>> Ping.....</div>
          <div>>>>>>>>>> You need to give me
            at least enough time to wake up :)</div>
          <div>>>>>>>>>></div>
          <div>>>>>>>>>>> Best wishes</div>
          <div>>>>>>>>>>> Emily Deng</div>
          <div>>>>>>>>>>></div>
          <div>>>>>>>>>>></div>
          <div>>>>>>>>>>></div>
          <div>>>>>>>>>>>>
            -----Original Message-----</div>
          <div>>>>>>>>>>>> From:
            amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org"><amd-gfx-bounces@lists.freedesktop.org></a> On</div>
          <div>>>> Behalf</div>
          <div>>>>>>>>>>>> Of Deng,
            Emily</div>
          <div>>>>>>>>>>>> Sent:
            Friday, November 8, 2019 10:56 AM</div>
          <div>>>>>>>>>>>> To: Koenig,
            Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; amd-</div>
          <div>>>>>>>>>>>>
            <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a></div>
          <div>>>>>>>>>>>> Subject: RE:
            [PATCH] drm/amdgpu: Fix the null pointer issue</div>
          <div>>>>>>>>>>>> for tdr</div>
          <div>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>
            -----Original Message-----</div>
          <div>>>>>>>>>>>>> From:
            Christian König <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com"><ckoenig.leichtzumerken@gmail.com></a></div>
          <div>>>>>>>>>>>>> Sent:
            Thursday, November 7, 2019 7:28 PM</div>
          <div>>>>>>>>>>>>> To:
            Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>;</div>
          <div>>>>>>>>>>>>>
            <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>>>>>>>>>>>>> Subject:
            Re: [PATCH] drm/amdgpu: Fix the null pointer issue</div>
          <div>>>>>>>>>>>>> for tdr</div>
          <div>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>> Am
            07.11.19 um 11:25 schrieb Emily Deng:</div>
          <div>>>>>>>>>>>>>> When
            the job is already signaled, the s_fence is freed.</div>
          <div>>>>>>>>>>>>>> Then
            it will has null pointer in amdgpu_device_gpu_recover.</div>
          <div>>>>>>>>>>>>> NAK, the
            s_fence is only set to NULL when the job is destroyed.</div>
          <div>>>>>>>>>>>>> See
            drm_sched_job_cleanup().</div>
          <div>>>>>>>>>>>> I know it is
            set to NULL in drm_sched_job_cleanup. But in one</div>
          <div>>>>>>>>>>>> case, when
            it enter into the amdgpu_device_gpu_recover, it</div>
          <div>>>>>>>>>>>> already in
            drm_sched_job_cleanup, and at this time, it will</div>
          <div>>>>>>>>>>>> go to free</div>
          <div>>>>>> job.</div>
          <div>>>>>>>>>>>> But the
            amdgpu_device_gpu_recover sometimes is faster. At</div>
          <div>>>>>>>>>>>> that time,
            job is not freed, but s_fence is already NULL.</div>
          <div>>>>>>>>>> No, that case can't
            happen. See here:</div>
          <div>>>>>>>>>></div>
          <div>>>>>>>>>>>            
            drm_sched_job_cleanup(s_job);</div>
          <div>>>>>>>>>>></div>
          <div>>>>>>>>>>>            
            amdgpu_ring_priority_put(ring, s_job->s_priority);</div>
          <div>>>>>>>>>>>            
            dma_fence_put(job->fence);</div>
          <div>>>>>>>>>>>            
            amdgpu_sync_free(&job->sync);</div>
          <div>>>>>>>>>>>            
            amdgpu_sync_free(&job->sched_sync);</div>
          <div>>>>>>>>>>>            
            kfree(job);</div>
          <div>>>>>>>>>> The job itself is
            freed up directly after freeing the reference</div>
          <div>>>>>>>>>> to the</div>
          <div>>>>>> s_fence.</div>
          <div>>>>>>>>>> So you are just
            papering over a much bigger problem here. This</div>
          <div>>>>>>>>>> patch is a clear
            NAK.</div>
          <div>>>>>>>>>></div>
          <div>>>>>>>>>> Regards,</div>
          <div>>>>>>>>>> Christian.</div>
          <div>>>>>>>>>></div>
          <div>>>>>>>>>>>>> When you
            see a job without an s_fence then that means the</div>
          <div>>>>>>>>>>>>> problem
            is somewhere else.</div>
          <div>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>> Regards,</div>
          <div>>>>>>>>>>>>>
            Christian.</div>
          <div>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>>
            Signed-off-by: Emily Deng <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a></div>
          <div>>>>>>>>>>>>>> ---</div>
          <div>>>>>>>>>>>>>>       
            drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-</div>
          <div>>>>>>>>>>>>>>       
            drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++---</div>
          <div>>--</div>
          <div>>>>>>>>>>>>>>       
            2 files changed, 7 insertions(+), 6 deletions(-)</div>
          <div>>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>> diff
            --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>>>>>>>>>>>>>>
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>>>>>>>>>>>>>>
            index e6ce949..5a8f08e 100644</div>
          <div>>>>>>>>>>>>>> ---
            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>>>>>>>>>>>>>> +++
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
          <div>>>>>>>>>>>>>> @@
            -4075,7 +4075,7 @@ int</div>
          <div>>>> amdgpu_device_gpu_recover(struct</div>
          <div>>>>>>>>>>>>>
            amdgpu_device *adev,</div>
          <div>>>>>>>>>>>>>>            
            *</div>
          <div>>>>>>>>>>>>>>            
            * job->base holds a reference to parent fence</div>
          <div>>>>>>>>>>>>>>            
            */</div>
          <div>>>>>>>>>>>>>> - 
            if (job && job->base.s_fence->parent
            &&</div>
          <div>>>>>>>>>>>>>> + 
            if (job && job->base.s_fence &&</div>
          <div>>>>>>>>>>>>>> +
            job->base.s_fence->parent</div>
          <div>>>>>>>> &&</div>
          <div>>>>>>>>>>>>>>               
            dma_fence_is_signaled(job->base.s_fence->parent))</div>
          <div>>>>>>>>>>>>>>                   
            job_signaled = true;</div>
          <div>>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>> diff
            --git a/drivers/gpu/drm/scheduler/sched_main.c</div>
          <div>>>>>>>>>>>>>>
            b/drivers/gpu/drm/scheduler/sched_main.c</div>
          <div>>>>>>>>>>>>>>
            index 31809ca..56cc10e 100644</div>
          <div>>>>>>>>>>>>>> ---
            a/drivers/gpu/drm/scheduler/sched_main.c</div>
          <div>>>>>>>>>>>>>> +++
            b/drivers/gpu/drm/scheduler/sched_main.c</div>
          <div>>>>>>>>>>>>>> @@
            -334,8 +334,8 @@ void</div>
          <div>>drm_sched_increase_karma(struct</div>
          <div>>>>>>>>>>>>>
            drm_sched_job</div>
          <div>>>>>>>>>>>>>>
            *bad)</div>
          <div>>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>>                           
            spin_lock(&rq->lock);</div>
          <div>>>>>>>>>>>>>>                           
            list_for_each_entry_safe(entity,</div>
          <div>>>>>>>>>>>>>> tmp,</div>
          <div>>>> &rq-</div>
          <div>>>>>>>>> entities,</div>
          <div>>>>>>>>>>>>> list) {</div>
          <div>>>>>>>>>>>>>>
            -                          if
            (bad->s_fence->scheduled.context</div>
          <div>>>>>>>> ==</div>
          <div>>>>>>>>>>>>>>
            -                              entity->fence_context) {</div>
          <div>>>>>>>>>>>>>>
            +                          if (bad->s_fence &&</div>
          <div>>>>>>>>>>>>>> +
            (bad->s_fence-</div>
          <div>>>>>>>>>>>>>>
            scheduled.context ==</div>
          <div>>>>>>>>>>>>>>
            +                              entity->fence_context)) {</div>
          <div>>>>>>>>>>>>>>                                           
            if</div>
          <div>>>>>>>>>>>>>>
            (atomic_read(&bad-</div>
          <div>>>>>>>>> karma) ></div>
          <div>>>>>>>>>>>>>>                                               
            bad->sched-</div>
          <div>>>>> hang_limit)</div>
          <div>>>>>>>>>>>>>>                                                   
            if</div>
          <div>>>>>>>>>>>>>>
            (entity-</div>
          <div>>>>> guilty) @@ -376,7 +376,7 @@ void</div>
          <div>>>>>>>>>>>>>>
            drm_sched_stop(struct</div>
          <div>>>>>>>> drm_gpu_scheduler</div>
          <div>>>>>>>>>>>>> *sched,
            struct drm_sched_job *bad)</div>
          <div>>>>>>>>>>>>>>            
            * This iteration is thread safe as sched thread</div>
          <div>>>>>>>>>>>>>> is</div>
          <div>>>> stopped.</div>
          <div>>>>>>>>>>>>>>            
            */</div>
          <div>>>>>>>>>>>>>>           
            list_for_each_entry_safe_reverse(s_job, tmp,</div>
          <div>>>>>>>>>>>>>>
            &sched- ring_mirror_list, node) {</div>
          <div>>>>>>>>>>>>>>
            -          if (s_job->s_fence->parent &&</div>
          <div>>>>>>>>>>>>>>
            +          if (s_job->s_fence &&
            s_job->s_fence->parent &&</div>
          <div>>>>>>>>>>>>>>                       
            dma_fence_remove_callback(s_job-</div>
          <div>>>>> s_fence-</div>
          <div>>>>>>>>> parent,</div>
          <div>>>>>>>>>>>>>>                                                 
            &s_job->cb)) {</div>
          <div>>>>>>>>>>>>>>                           
            atomic_dec(&sched->hw_rq_count);</div>
          <div>>>> @@ -</div>
          <div>>>>>>>> 395,7</div>
          <div>>>>>>>>>>>> +395,8 @@
            void</div>
          <div>>>>>>>>>>>>>>
            drm_sched_stop(struct drm_gpu_scheduler</div>
          <div>>>>>>>>>>>>> *sched,
            struct drm_sched_job *bad)</div>
          <div>>>>>>>>>>>>>>                            
            *</div>
          <div>>>>>>>>>>>>>>                            
            * Job is still alive so fence</div>
          <div>>>>>>>>>>>>>>
            refcount at</div>
          <div>>>> least 1</div>
          <div>>>>>>>>>>>>>>                            
            */</div>
          <div>>>>>>>>>>>>>>
            -                 
            dma_fence_wait(&s_job->s_fence->finished,</div>
          <div>>>>>>>> false);</div>
          <div>>>>>>>>>>>>>>
            +                  if (s_job->s_fence)</div>
          <div>>>>>>>>>>>>>>
            +                         
            dma_fence_wait(&s_job->s_fence-</div>
          <div>>>>>>>>> finished,</div>
          <div>>>>>>>>>>>>> false);</div>
          <div>>>>>>>>>>>>>>                           
            /*</div>
          <div>>>>>>>>>>>>>>                            
            * We must keep bad job alive</div>
          <div>>>>>>>>>>>>>> for
            later</div>
          <div>>>> use</div>
          <div>>>>>>>> during @@</div>
          <div>>>>>>>>>>>>> -438,7</div>
          <div>>>>>>>>>>>>>>
            +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler</div>
          <div>>>>>> *sched,</div>
          <div>>>>>>>>>>>>>>
            +bool</div>
          <div>>>>>>>>>>>>>
            full_recovery)</div>
          <div>>>>>>>>>>>>>>            
            * GPU recovers can't run in parallel.</div>
          <div>>>>>>>>>>>>>>            
            */</div>
          <div>>>>>>>>>>>>>>           
            list_for_each_entry_safe(s_job, tmp,</div>
          <div>>>>>>>>>>>>>>
            &sched->ring_mirror_list,</div>
          <div>>>>>>>>>>>>>>
            node)</div>
          <div>>>>>>>>>>>>> {</div>
          <div>>>>>>>>>>>>>>
            -          struct dma_fence *fence =
            s_job->s_fence->parent;</div>
          <div>>>>>>>>>>>>>>
            +          struct dma_fence *fence = s_job->s_fence ?</div>
          <div>>>>>>>>>>>>>> +
            s_job-</div>
          <div>>>>>>>>> s_fence-</div>
          <div>>>>>>>>>>>>>>
            parent :</div>
          <div>>>>>>>>>>>>>>
            +NULL;</div>
          <div>>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>>>                   
            atomic_inc(&sched->hw_rq_count);</div>
          <div>>>>>>>>>>>>>></div>
          <div>>>>>>>>>>>>
            _______________________________________________</div>
          <div>>>>>>>>>>>> amd-gfx
            mailing list</div>
          <div>>>>>>>>>>>>
            <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>>>>>>>>>>>> <a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">
              https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
          <div>></div>
          <div>>_______________________________________________</div>
          <div>>amd-gfx mailing list</div>
          <div>><a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a></div>
          <div>><a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
          <div><font face="Times New Roman"> </font></div>
        </span></font>
    </blockquote>
    <br>
  </body>
</html>