<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>This why I asked for a trace with timer enabled, but since there
is a finite number of places we touch the timer Emily can just put
prints there. Also, I wonder if this temp fix helps her with the
issue or not.</p>
<p>Andrey<br>
</p>
<div class="moz-cite-prefix">On 11/13/19 2:36 AM, Christian König
wrote:<br>
</div>
<blockquote type="cite" cite="mid:33ffe2f1-32b6-a238-3752-cee67cd9e141@gmail.com">
<div class="moz-cite-prefix">The question is where do we rearm the
timer for this problem to occur?<br>
<br>
Regards,<br>
Christian. <br>
<br>
Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:<br>
</div>
<blockquote type="cite" cite="mid:9269d447-ed32-81f7-ab43-cb16139096e2@amd.com">
<p>I was able to reproduce the crash by using the attached
simulate_crash.patch - waiting on guilty job to signal in
reset work and artificially rearming the timeout timer just
before the check for
!cancel_delayed_work(&sched->work_tdr) in
drm_sched_cleanup_jobs - crash log attached in crash.log. This
I think confirms my theory i described earlier in this thread.</p>
<p>basic_fix.patch handles this by testing whether another timer
already armed ob this scheduler or is there a timeout work in
execution right now (see documentation for work_busy) -
obviously this is not a full solution as this will not
protect from races if for example there is immediate work
scheduling such as in <font size="2" face="Calibri"><span style="font-size:11pt;">drm_sched_fault - so we probably
need to account for this by making drm_sched_cleanup_jobs
(at least in the part where it iterates ring mirror list
and frees jobs) and GPU reset really mutually exclusive
and not like now.</span></font></p>
<p><font size="2" face="Calibri"><span style="font-size:11pt;">Andrey<br>
</span></font></p>
<p><br>
</p>
<div class="moz-cite-prefix">On 11/11/19 4:11 PM, Christian
König wrote:<br>
</div>
<blockquote type="cite" cite="mid:2f035f22-4057-dd9e-27ef-0f5612113e29@amd.com">
<div class="moz-cite-prefix">Hi Emily,<br>
<br>
you need to print which scheduler instance is freeing the
jobs and which one is triggering the reset. The TID and PID
is completely meaningless here since we are called from
different worker threads and the TID/PID can change on each
call.<br>
<br>
Apart from that I will look into this a bit deeper when I
have time.<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 12.11.19 um 07:02 schrieb Deng, Emily:<br>
</div>
<blockquote type="cite" cite="mid:MN2PR12MB29750EDB35E27DF9CD63152C8F770@MN2PR12MB2975.namprd12.prod.outlook.com">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from rtf -->
<style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
<font size="2" face="Calibri"><span style="font-size:11pt;">
<div>Hi Christian,</div>
<div> I add the follow print in function
drm_sched_cleanup_jobs. From the log it shows that
only use cancel_delayed_work could not avoid to free
job when the sched is in reset. But don’t know exactly
where it is wrong about the driver. Do you have any
suggestion about this?</div>
<div><font face="Times New Roman"> </font></div>
<div>+
printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
pid:%lu\n", current->tgid, current->pid);</div>
<div> </div>
<div> /*</div>
<div> * Don't destroy jobs while the timeout
worker is running OR thread</div>
<div> * is being parked and hence assumed to not
touch ring_mirror_list</div>
<div> */</div>
<div> if ((sched->timeout !=
MAX_SCHEDULE_TIMEOUT &&</div>
<div>
!cancel_delayed_work(&sched->work_tdr)))</div>
<div> return;</div>
<div>+
printk("Emily:drm_sched_cleanup_jobs,tid:%lu,
pid:%lu\n", current->tgid, current->pid);</div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Times New Roman"> </font></div>
<div><font face="Calibri">Best wishes</font></div>
<div><font face="Calibri">Emily Deng</font></div>
<div><font face="Times New Roman"> </font></div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695091]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695104]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262,
pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695107]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262,
pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.222954] [drm:amdgpu_job_timedout [amdgpu]]
*ERROR* ring sdma0 timeout, signaled seq=78585,
emitted seq=78587</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.224275] [drm:amdgpu_job_timedout [amdgpu]]
*ERROR* Process information: process pid 0 thread
pid 0, s_job:00000000fe75ab36,tid=15603, pid=15603</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225413] amdgpu 0000:00:08.0: <font color="#00B050">GPU reset begin</font>!</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225417]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225425]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262,
pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225428] Emily:<font color="#00B050">amdgpu_job_free_cb</font>,Process
information: process pid 0 thread pid 0,
s_job:00000000fe75ab36, tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225429]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262,
pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225473]
Emily:drm_sched_cleanup_jobs:begin,tid:2253, pid:2253</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225486]
Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262,
pid:2262</div>
<div>Nov 12 12:58:20
ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225494] Emily:amdgpu_job_free_cb,Process
information: process pid 0 thread pid 0,
s_job:00000000f086ec84, tid:2262, pid:2262</div>
<div>>-----Original Message-----</div>
<div>>From: Grodzovsky, Andrey <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true"><Andrey.Grodzovsky@amd.com></a></div>
<div>>Sent: Tuesday, November 12, 2019 11:28 AM</div>
<div>>To: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
Deng, Emily</div>
<div>><a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>>Subject: Re: [PATCH] drm/amdgpu: Fix the null
pointer issue for tdr</div>
<div>></div>
<div>>Thinking more about this claim - we assume here
that if cancel_delayed_work</div>
<div>>returned true it guarantees that timeout work
is not running but, it merely</div>
<div>>means there was a pending timeout work which
was removed from the</div>
<div>>workqueue before it's timer elapsed and so it
didn't have a chance to be</div>
<div>>dequeued and executed, it doesn't cover already
executing work. So there is a</div>
<div>>possibility where while timeout work started
executing another timeout work</div>
<div>>already got enqueued (maybe through earlier
cleanup jobs or through</div>
<div>>drm_sched_fault) and if at this point another
drm_sched_cleanup_jobs runs</div>
<div>>cancel_delayed_work(&sched->work_tdr)
will return true even while there is a</div>
<div>>timeout job in progress.</div>
<div>>Unfortunately we cannot change
cancel_delayed_work to</div>
<div>>cancel_delayed_work_sync to flush the timeout
work as timeout work itself</div>
<div>>waits for schedule thread to be parked again
when calling park_thread.</div>
<div>></div>
<div>>Andrey</div>
<div>></div>
<div>>________________________________________</div>
<div>>From: amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx-bounces@lists.freedesktop.org></a>
on behalf of</div>
<div>>Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
<div>>Sent: 08 November 2019 05:35:18</div>
<div>>To: Deng, Emily; <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>>Subject: Re: [PATCH] drm/amdgpu: Fix the null
pointer issue for tdr</div>
<div>></div>
<div>>Hi Emily,</div>
<div>></div>
<div>>exactly that can't happen. See here:</div>
<div>></div>
<div>>> /* Don't destroy jobs while the
timeout worker is running */</div>
<div>>> if (sched->timeout !=
MAX_SCHEDULE_TIMEOUT &&</div>
<div>>>
!cancel_delayed_work(&sched->work_tdr))</div>
<div>>> return NULL;</div>
<div>></div>
<div>>We never free jobs while the timeout working is
running to prevent exactly</div>
<div>>that issue.</div>
<div>></div>
<div>>Regards,</div>
<div>>Christian.</div>
<div>></div>
<div>>Am 08.11.19 um 11:32 schrieb Deng, Emily:</div>
<div>>> Hi Christian,</div>
<div>>> The drm_sched_job_timedout->
amdgpu_job_timedout call</div>
<div>>amdgpu_device_gpu_recover. I mean the main
scheduler free the jobs while</div>
<div>>in amdgpu_device_gpu_recover, and before
calling drm_sched_stop.</div>
<div>>></div>
<div>>> Best wishes</div>
<div>>> Emily Deng</div>
<div>>></div>
<div>>></div>
<div>>></div>
<div>>>> -----Original Message-----</div>
<div>>>> From: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
<div>>>> Sent: Friday, November 8, 2019 6:26 PM</div>
<div>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>>>> Subject: Re: [PATCH] drm/amdgpu: Fix
the null pointer issue for tdr</div>
<div>>>></div>
<div>>>> Hi Emily,</div>
<div>>>></div>
<div>>>> well who is calling
amdgpu_device_gpu_recover() in this case?</div>
<div>>>></div>
<div>>>> When it's not the scheduler we
shouldn't have a guilty job in the first place.</div>
<div>>>></div>
<div>>>> Regards,</div>
<div>>>> Christian.</div>
<div>>>></div>
<div>>>> Am 08.11.19 um 11:22 schrieb Deng,
Emily:</div>
<div>>>>> Hi Chrisitan,</div>
<div>>>>> No, I am with the new
branch and also has the patch. Even it</div>
<div>>>>> are freed by</div>
<div>>>> main scheduler, how we could avoid
main scheduler to free jobs while</div>
<div>>>> enter to function
amdgpu_device_gpu_recover?</div>
<div>>>>> Best wishes</div>
<div>>>>> Emily Deng</div>
<div>>>>></div>
<div>>>>></div>
<div>>>>></div>
<div>>>>>> -----Original Message-----</div>
<div>>>>>> From: Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
<div>>>>>> Sent: Friday, November 8, 2019
6:15 PM</div>
<div>>>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
amd-</div>
<div>><a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
<div>>>>>> Subject: Re: [PATCH]
drm/amdgpu: Fix the null pointer issue for tdr</div>
<div>>>>>></div>
<div>>>>>> Hi Emily,</div>
<div>>>>>></div>
<div>>>>>> in this case you are on an old
code branch.</div>
<div>>>>>></div>
<div>>>>>> Jobs are freed now by the main
scheduler thread and only if no</div>
<div>>>>>> timeout handler is running.</div>
<div>>>>>></div>
<div>>>>>> See this patch here:</div>
<div>>>>>>> commit
5918045c4ed492fb5813f980dcf89a90fefd0a4e</div>
<div>>>>>>> Author: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com" moz-do-not-send="true"><christian.koenig@amd.com></a></div>
<div>>>>>>> Date: Thu Apr 18
11:00:21 2019 -0400</div>
<div>>>>>>></div>
<div>>>>>>> drm/scheduler:
rework job destruction</div>
<div>>>>>> Regards,</div>
<div>>>>>> Christian.</div>
<div>>>>>></div>
<div>>>>>> Am 08.11.19 um 11:11 schrieb
Deng, Emily:</div>
<div>>>>>>> Hi Christian,</div>
<div>>>>>>> Please refer to
follow log, when it enter to</div>
<div>>>>>>> amdgpu_device_gpu_recover</div>
<div>>>>>> function, the bad job
000000005086879e is freeing in function</div>
<div>>>>>> amdgpu_job_free_cb at the
same time, because of the hardware fence</div>
<div>>>> signal.</div>
<div>>>>>> But amdgpu_device_gpu_recover
goes faster, at this case, the</div>
<div>>>>>> s_fence is already freed, but
job is not freed in time. Then this issue</div>
<div>>occurs.</div>
<div>>>>>>> [ 449.792189]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring</div>
<div>>>> sdma0</div>
<div>>>>>>> timeout, signaled
seq=2481, emitted seq=2483 [ 449.793202]</div>
<div>>>>>>> [drm:amdgpu_job_timedout
[amdgpu]] *ERROR* Process information:</div>
<div>>>>>> process pid 0 thread pid 0,
s_job:000000005086879e [ 449.794163]</div>
<div>>>>>> amdgpu</div>
<div>>>>>> 0000:00:08.0: GPU reset begin!</div>
<div>>>>>>> [ 449.794175]
Emily:amdgpu_job_free_cb,Process information:</div>
<div>>>>>>> process pid 0 thread pid
0, s_job:000000005086879e [ 449.794221]</div>
<div>>>>>>>
Emily:amdgpu_job_free_cb,Process information: process
pid 0</div>
<div>>>>>>> thread pid 0,
s_job:0000000066eb74ab [ 449.794222]</div>
<div>>>>>>>
Emily:amdgpu_job_free_cb,Process information: process
pid 0</div>
<div>>>>>>> thread pid 0,
s_job:00000000d4438ad9 [ 449.794255]</div>
<div>>>>>>>
Emily:amdgpu_job_free_cb,Process information: process
pid 0</div>
<div>>>>>>> thread pid 0,
s_job:00000000b6d69c65 [ 449.794257]</div>
<div>>>>>>>
Emily:amdgpu_job_free_cb,Process information: process
pid 0</div>
<div>>>>>>> thread pid 0,</div>
<div>>>>>> s_job:00000000ea85e922 [
449.794287]</div>
<div>>>>>>
Emily:amdgpu_job_free_cb,Process</div>
<div>>>>>> information: process pid 0
thread pid 0, s_job:00000000ed3a5ac6 [</div>
<div>>>>>> 449.794366] BUG: unable to
handle kernel NULL pointer dereference</div>
<div>>>>>> at</div>
<div>>>>>> 00000000000000c0 [
449.800818] PGD 0 P4D 0 [ 449.801040] Oops:</div>
<div>>>>>> 0000 [#1] SMP PTI</div>
<div>>>>>>> [ 449.801338] CPU: 3 PID:
55 Comm: kworker/3:1 Tainted: G OE</div>
<div>>>>>> 4.18.0-15-generic
#16~18.04.1-Ubuntu</div>
<div>>>>>>> [ 449.802157] Hardware
name: QEMU Standard PC (i440FX + PIIX,</div>
<div>>>>>>> 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]</div>
<div>>>>>>> Workqueue: events
drm_sched_job_timedout [amd_sched] [</div>
<div>>>>>>> 449.803488]</div>
<div>>>> RIP:</div>
<div>>>>>>
0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]</div>
<div>>>>>>> [ 449.804020] Code: dd ff
ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff</div>
<div>>>>>>> ff</div>
<div>>>>>>> 45 85 e4 0f</div>
<div>>>>>> 85 a1 00 00 00 48 8b 45 b0 48
85 c0 0f 84 60 01 00 00 48 8b 40 10</div>
<div>>>>>> <48> 8b</div>
<div>>>> 98</div>
<div>>>>>> c0 00 00 00 48 85 db
0f 84 4c 01 00 00 48 8b 43 48 a8 01</div>
<div>>>>>>> [ 449.805593] RSP:
0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [</div>
<div>>>>>>> 449.806032] RAX:
0000000000000000 RBX: 0000000000000000 RCX:</div>
<div>>>>>>> 0000000000000000 [
449.806625] RDX: ffffb4c7c08f5ac0 RSI:</div>
<div>>>>>>> 0000000fffffffe0 RDI:
0000000000000246 [ 449.807224] RBP:</div>
<div>>>>>>> ffffb4c7c08f7de0 R08:
00000068b9d54000 R09: 0000000000000000 [</div>
<div>>>>>>> 449.807818] R10:
0000000000000000 R11: 0000000000000148 R12:</div>
<div>>>>>>> 0000000000000000 [
449.808411] R13: ffffb4c7c08f7da0 R14:</div>
<div>>>>>>> ffff8d82b8525d40 R15:
ffff8d82b8525d40 [ 449.809004] FS:</div>
<div>>>>>>> 0000000000000000(0000)
GS:ffff8d82bfd80000(0000)</div>
<div>>>>>>> knlGS:0000000000000000 [
449.809674] CS: 0010 DS: 0000 ES: 0000</div>
<div>>CR0:</div>
<div>>>>>>> 0000000080050033 [
449.810153] CR2: 00000000000000c0 CR3:</div>
<div>>>>>>> 000000003cc0a001 CR4:
00000000003606e0 [ 449.810747] DR0:</div>
<div>>>>>> 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000 [</div>
<div>>>>>> 449.811344] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7:</div>
<div>>>>>> 0000000000000400 [
449.811937] Call Trace:</div>
<div>>>>>>> [ 449.812206]
amdgpu_job_timedout+0x114/0x140 [amdgpu] [</div>
<div>>>>>>> 449.812635]
drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
<div>>>>>>> 449.813139] ?
amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [</div>
<div>>>>>>> 449.813609] ?
drm_sched_job_timedout+0x44/0x90 [amd_sched] [</div>
<div>>>>>>> 449.814077]
process_one_work+0x1fd/0x3f0 [ 449.814417]</div>
<div>>>>>>> worker_thread+0x34/0x410
[ 449.814728] kthread+0x121/0x140 [</div>
<div>>>>>>> 449.815004] ?
process_one_work+0x3f0/0x3f0 [ 449.815374] ?</div>
<div>>>>>>>
kthread_create_worker_on_cpu+0x70/0x70</div>
<div>>>>>>> [ 449.815799]
ret_from_fork+0x35/0x40</div>
<div>>>>>>></div>
<div>>>>>>>> -----Original
Message-----</div>
<div>>>>>>>> From: Koenig,
Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
<div>>>>>>>> Sent: Friday, November
8, 2019 5:43 PM</div>
<div>>>>>>>> To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
amd-</div>
<div>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
<div>>>>>>>> Subject: Re: [PATCH]
drm/amdgpu: Fix the null pointer issue for</div>
<div>>>>>>>> tdr</div>
<div>>>>>>>></div>
<div>>>>>>>> Am 08.11.19 um 10:39
schrieb Deng, Emily:</div>
<div>>>>>>>>> Sorry, please take
your time.</div>
<div>>>>>>>> Have you seen my other
response a bit below?</div>
<div>>>>>>>></div>
<div>>>>>>>> I can't follow how it
would be possible for job->s_fence to be</div>
<div>>>>>>>> NULL without the job
also being freed.</div>
<div>>>>>>>></div>
<div>>>>>>>> So it looks like this
patch is just papering over some bigger issues.</div>
<div>>>>>>>></div>
<div>>>>>>>> Regards,</div>
<div>>>>>>>> Christian.</div>
<div>>>>>>>></div>
<div>>>>>>>>> Best wishes</div>
<div>>>>>>>>> Emily Deng</div>
<div>>>>>>>>></div>
<div>>>>>>>>></div>
<div>>>>>>>>></div>
<div>>>>>>>>>> -----Original
Message-----</div>
<div>>>>>>>>>> From: Koenig,
Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a></div>
<div>>>>>>>>>> Sent: Friday,
November 8, 2019 5:08 PM</div>
<div>>>>>>>>>> To: Deng,
Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;
amd-</div>
<div>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
<div>>>>>>>>>> Subject: Re:
[PATCH] drm/amdgpu: Fix the null pointer issue for</div>
<div>>>>>>>>>> tdr</div>
<div>>>>>>>>>></div>
<div>>>>>>>>>> Am 08.11.19 um
09:52 schrieb Deng, Emily:</div>
<div>>>>>>>>>>> Ping.....</div>
<div>>>>>>>>>> You need to
give me at least enough time to wake up :)</div>
<div>>>>>>>>>></div>
<div>>>>>>>>>>> Best
wishes</div>
<div>>>>>>>>>>> Emily Deng</div>
<div>>>>>>>>>>></div>
<div>>>>>>>>>>></div>
<div>>>>>>>>>>></div>
<div>>>>>>>>>>>>
-----Original Message-----</div>
<div>>>>>>>>>>>> From:
amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx-bounces@lists.freedesktop.org></a>
On</div>
<div>>>> Behalf</div>
<div>>>>>>>>>>>> Of
Deng, Emily</div>
<div>>>>>>>>>>>> Sent:
Friday, November 8, 2019 10:56 AM</div>
<div>>>>>>>>>>>> To:
Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
amd-</div>
<div>>>>>>>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org" moz-do-not-send="true">gfx@lists.freedesktop.org</a></div>
<div>>>>>>>>>>>>
Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer
issue</div>
<div>>>>>>>>>>>> for
tdr</div>
<div>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>
-----Original Message-----</div>
<div>>>>>>>>>>>>>
From: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true"><ckoenig.leichtzumerken@gmail.com></a></div>
<div>>>>>>>>>>>>>
Sent: Thursday, November 7, 2019 7:28 PM</div>
<div>>>>>>>>>>>>>
To: Deng, Emily <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a>;</div>
<div>>>>>>>>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>>>>>>>>>>>>>
Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
issue</div>
<div>>>>>>>>>>>>>
for tdr</div>
<div>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>> Am
07.11.19 um 11:25 schrieb Emily Deng:</div>
<div>>>>>>>>>>>>>>
When the job is already signaled, the s_fence is
freed.</div>
<div>>>>>>>>>>>>>>
Then it will has null pointer in
amdgpu_device_gpu_recover.</div>
<div>>>>>>>>>>>>>
NAK, the s_fence is only set to NULL when the job is
destroyed.</div>
<div>>>>>>>>>>>>>
See drm_sched_job_cleanup().</div>
<div>>>>>>>>>>>> I know
it is set to NULL in drm_sched_job_cleanup. But in one</div>
<div>>>>>>>>>>>> case,
when it enter into the amdgpu_device_gpu_recover, it</div>
<div>>>>>>>>>>>>
already in drm_sched_job_cleanup, and at this time, it
will</div>
<div>>>>>>>>>>>> go to
free</div>
<div>>>>>> job.</div>
<div>>>>>>>>>>>> But
the amdgpu_device_gpu_recover sometimes is faster. At</div>
<div>>>>>>>>>>>> that
time, job is not freed, but s_fence is already NULL.</div>
<div>>>>>>>>>> No, that case
can't happen. See here:</div>
<div>>>>>>>>>></div>
<div>>>>>>>>>>>
drm_sched_job_cleanup(s_job);</div>
<div>>>>>>>>>>></div>
<div>>>>>>>>>>>
amdgpu_ring_priority_put(ring, s_job->s_priority);</div>
<div>>>>>>>>>>>
dma_fence_put(job->fence);</div>
<div>>>>>>>>>>>
amdgpu_sync_free(&job->sync);</div>
<div>>>>>>>>>>>
amdgpu_sync_free(&job->sched_sync);</div>
<div>>>>>>>>>>>
kfree(job);</div>
<div>>>>>>>>>> The job itself
is freed up directly after freeing the reference</div>
<div>>>>>>>>>> to the</div>
<div>>>>>> s_fence.</div>
<div>>>>>>>>>> So you are
just papering over a much bigger problem here. This</div>
<div>>>>>>>>>> patch is a
clear NAK.</div>
<div>>>>>>>>>></div>
<div>>>>>>>>>> Regards,</div>
<div>>>>>>>>>> Christian.</div>
<div>>>>>>>>>></div>
<div>>>>>>>>>>>>>
When you see a job without an s_fence then that means
the</div>
<div>>>>>>>>>>>>>
problem is somewhere else.</div>
<div>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>
Regards,</div>
<div>>>>>>>>>>>>>
Christian.</div>
<div>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>>
Signed-off-by: Emily Deng <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com" moz-do-not-send="true"><Emily.Deng@amd.com></a></div>
<div>>>>>>>>>>>>>>
---</div>
<div>>>>>>>>>>>>>>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-</div>
<div>>>>>>>>>>>>>>
drivers/gpu/drm/scheduler/sched_main.c | 11
++++++---</div>
<div>>--</div>
<div>>>>>>>>>>>>>>
2 files changed, 7 insertions(+), 6 deletions(-)</div>
<div>>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>>
diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>>>>>>>>>>>>>>
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>>>>>>>>>>>>>>
index e6ce949..5a8f08e 100644</div>
<div>>>>>>>>>>>>>>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>>>>>>>>>>>>>>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c</div>
<div>>>>>>>>>>>>>>
@@ -4075,7 +4075,7 @@ int</div>
<div>>>> amdgpu_device_gpu_recover(struct</div>
<div>>>>>>>>>>>>>
amdgpu_device *adev,</div>
<div>>>>>>>>>>>>>>
*</div>
<div>>>>>>>>>>>>>>
* job->base holds a reference to parent fence</div>
<div>>>>>>>>>>>>>>
*/</div>
<div>>>>>>>>>>>>>>
- if (job && job->base.s_fence->parent
&&</div>
<div>>>>>>>>>>>>>>
+ if (job && job->base.s_fence &&</div>
<div>>>>>>>>>>>>>>
+ job->base.s_fence->parent</div>
<div>>>>>>>> &&</div>
<div>>>>>>>>>>>>>>
dma_fence_is_signaled(job->base.s_fence->parent))</div>
<div>>>>>>>>>>>>>>
job_signaled = true;</div>
<div>>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>>
diff --git a/drivers/gpu/drm/scheduler/sched_main.c</div>
<div>>>>>>>>>>>>>>
b/drivers/gpu/drm/scheduler/sched_main.c</div>
<div>>>>>>>>>>>>>>
index 31809ca..56cc10e 100644</div>
<div>>>>>>>>>>>>>>
--- a/drivers/gpu/drm/scheduler/sched_main.c</div>
<div>>>>>>>>>>>>>>
+++ b/drivers/gpu/drm/scheduler/sched_main.c</div>
<div>>>>>>>>>>>>>>
@@ -334,8 +334,8 @@ void</div>
<div>>drm_sched_increase_karma(struct</div>
<div>>>>>>>>>>>>>
drm_sched_job</div>
<div>>>>>>>>>>>>>>
*bad)</div>
<div>>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>>
spin_lock(&rq->lock);</div>
<div>>>>>>>>>>>>>>
list_for_each_entry_safe(entity,</div>
<div>>>>>>>>>>>>>>
tmp,</div>
<div>>>> &rq-</div>
<div>>>>>>>>> entities,</div>
<div>>>>>>>>>>>>>
list) {</div>
<div>>>>>>>>>>>>>>
- if
(bad->s_fence->scheduled.context</div>
<div>>>>>>>> ==</div>
<div>>>>>>>>>>>>>>
-
entity->fence_context) {</div>
<div>>>>>>>>>>>>>>
+ if (bad->s_fence
&&</div>
<div>>>>>>>>>>>>>>
+ (bad->s_fence-</div>
<div>>>>>>>>>>>>>>
scheduled.context ==</div>
<div>>>>>>>>>>>>>>
+
entity->fence_context)) {</div>
<div>>>>>>>>>>>>>>
if</div>
<div>>>>>>>>>>>>>>
(atomic_read(&bad-</div>
<div>>>>>>>>> karma) ></div>
<div>>>>>>>>>>>>>>
bad->sched-</div>
<div>>>>> hang_limit)</div>
<div>>>>>>>>>>>>>>
if</div>
<div>>>>>>>>>>>>>>
(entity-</div>
<div>>>>> guilty) @@ -376,7 +376,7 @@ void</div>
<div>>>>>>>>>>>>>>
drm_sched_stop(struct</div>
<div>>>>>>>> drm_gpu_scheduler</div>
<div>>>>>>>>>>>>>
*sched, struct drm_sched_job *bad)</div>
<div>>>>>>>>>>>>>>
* This iteration is thread safe as sched thread</div>
<div>>>>>>>>>>>>>>
is</div>
<div>>>> stopped.</div>
<div>>>>>>>>>>>>>>
*/</div>
<div>>>>>>>>>>>>>>
list_for_each_entry_safe_reverse(s_job, tmp,</div>
<div>>>>>>>>>>>>>>
&sched- ring_mirror_list, node) {</div>
<div>>>>>>>>>>>>>>
- if (s_job->s_fence->parent &&</div>
<div>>>>>>>>>>>>>>
+ if (s_job->s_fence &&
s_job->s_fence->parent &&</div>
<div>>>>>>>>>>>>>>
dma_fence_remove_callback(s_job-</div>
<div>>>>> s_fence-</div>
<div>>>>>>>>> parent,</div>
<div>>>>>>>>>>>>>>
&s_job->cb)) {</div>
<div>>>>>>>>>>>>>>
atomic_dec(&sched->hw_rq_count);</div>
<div>>>> @@ -</div>
<div>>>>>>>> 395,7</div>
<div>>>>>>>>>>>> +395,8
@@ void</div>
<div>>>>>>>>>>>>>>
drm_sched_stop(struct drm_gpu_scheduler</div>
<div>>>>>>>>>>>>>
*sched, struct drm_sched_job *bad)</div>
<div>>>>>>>>>>>>>>
*</div>
<div>>>>>>>>>>>>>>
* Job is still alive so fence</div>
<div>>>>>>>>>>>>>>
refcount at</div>
<div>>>> least 1</div>
<div>>>>>>>>>>>>>>
*/</div>
<div>>>>>>>>>>>>>>
-
dma_fence_wait(&s_job->s_fence->finished,</div>
<div>>>>>>>> false);</div>
<div>>>>>>>>>>>>>>
+ if (s_job->s_fence)</div>
<div>>>>>>>>>>>>>>
+
dma_fence_wait(&s_job->s_fence-</div>
<div>>>>>>>>> finished,</div>
<div>>>>>>>>>>>>>
false);</div>
<div>>>>>>>>>>>>>>
/*</div>
<div>>>>>>>>>>>>>>
* We must keep bad job alive</div>
<div>>>>>>>>>>>>>>
for later</div>
<div>>>> use</div>
<div>>>>>>>> during @@</div>
<div>>>>>>>>>>>>>
-438,7</div>
<div>>>>>>>>>>>>>>
+439,7 @@ void drm_sched_start(struct
drm_gpu_scheduler</div>
<div>>>>>> *sched,</div>
<div>>>>>>>>>>>>>>
+bool</div>
<div>>>>>>>>>>>>>
full_recovery)</div>
<div>>>>>>>>>>>>>>
* GPU recovers can't run in parallel.</div>
<div>>>>>>>>>>>>>>
*/</div>
<div>>>>>>>>>>>>>>
list_for_each_entry_safe(s_job, tmp,</div>
<div>>>>>>>>>>>>>>
&sched->ring_mirror_list,</div>
<div>>>>>>>>>>>>>>
node)</div>
<div>>>>>>>>>>>>> {</div>
<div>>>>>>>>>>>>>>
- struct dma_fence *fence =
s_job->s_fence->parent;</div>
<div>>>>>>>>>>>>>>
+ struct dma_fence *fence = s_job->s_fence
?</div>
<div>>>>>>>>>>>>>>
+ s_job-</div>
<div>>>>>>>>> s_fence-</div>
<div>>>>>>>>>>>>>>
parent :</div>
<div>>>>>>>>>>>>>>
+NULL;</div>
<div>>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>>>
atomic_inc(&sched->hw_rq_count);</div>
<div>>>>>>>>>>>>>></div>
<div>>>>>>>>>>>>
_______________________________________________</div>
<div>>>>>>>>>>>>
amd-gfx mailing list</div>
<div>>>>>>>>>>>> <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>>>>>>>>>>>> <a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">
https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
<div>></div>
<div>>_______________________________________________</div>
<div>>amd-gfx mailing list</div>
<div>><a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a></div>
<div>><a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></div>
<div><font face="Times New Roman"> </font></div>
</span></font> </blockquote>
<br>
</blockquote>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></pre>
</blockquote>
<br>
</blockquote>
</body>
</html>