[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Mon Nov 18 16:16:11 UTC 2019

Hi Andrey,

the only thing which doesn't looks so good is the switch to 
list_empty_careful in drm_sched_cleanup_jobs.

We either take the lock here or we don't, but please not that extra 
checking.

Christian.

Am 18.11.19 um 15:07 schrieb Andrey Grodzovsky:
> Thanks Emily.
>
> Christan - ping for review.
>
> Andrey
>
> On 11/14/19 11:39 PM, Deng, Emily wrote:
>> Hi Andrey,
>>       Currently, I am busying with another issue, maybe will try next 
>> week.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>> Sent: Friday, November 15, 2019 6:14 AM
>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>> Attached.
>>>
>>> Emily - can you give it a try ?
>>>
>>> Andrey
>>>
>>> On 11/14/19 3:12 AM, Christian König wrote:
>>>>> What about instead of peeking at the job to actually remove it from
>>>>> ring_mirror_list right there,
>>>> Also an interesting idea. We would need to protect the mirror list
>>>> with a lock again, but that should be the lesser evil.
>>>>
>>>> Maybe prototype that and see if it works or not.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>>>> Another more fundamental question: Could we get rid of the timeout
>>>>>> job at all?
>>>>>
>>>>> There are other stuff there besides picking the first unfinished job
>>>>> which is common for all the drivers - such as freeing guilty signaled
>>>>> job and rearming the timeout work timer.
>>>>>
>>>>>
>>>>>> I mean we used to give this as parameter to the scheduler callback
>>>>>> because we had the timeout worker in the job, but that is no longer
>>>>>> the case.
>>>>>>
>>>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>>>          job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>>                                         struct drm_sched_job, 
>>>>>>> node);
>>>>>> Why don't we just remove that here and only get the first job after
>>>>>> we have stopped the scheduler?
>>>>>
>>>>> Should be ok since we have the extra check for __kthread_should_park
>>>>> in drm_sched_cleanup_jobs which will protect us in this case from a
>>>>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs
>>>>> after we already parked it. The problem here is we need the
>>>>> drm_sched_job to access the private data for each client driver (see
>>>>> amdgpu_job_timedout for example). What about instead of peeking at
>>>>> the job to actually remove it from ring_mirror_list right there, go
>>>>> ahead with it through the reset routine, if it's signaled in the
>>>>> meanwhile that great - release it, otherwise put it back into
>>>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>>> This why I asked for a trace with timer enabled, but since there is
>>>>>>> a finite number of places we touch the timer Emily can just put
>>>>>>> prints there. Also, I wonder if this temp fix helps her with the
>>>>>>> issue or not.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>>>> The question is where do we rearm the timer for this problem to
>>>>>>>> occur?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>>>> simulate_crash.patch - waiting on guilty job to signal in reset
>>>>>>>>> work and artificially rearming the timeout timer just before the
>>>>>>>>> check for !cancel_delayed_work(&sched->work_tdr)  in
>>>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I
>>>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>>>
>>>>>>>>> basic_fix.patch handles this by testing whether another timer
>>>>>>>>> already armed ob this scheduler or is there a timeout work in
>>>>>>>>> execution right now (see documentation for work_busy) - obviously
>>>>>>>>> this is not a full solution as this will not protect from races
>>>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>>>> drm_sched_fault -  so we probably need to account for this by
>>>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>>>> iterates ring mirror list and frees jobs) and GPU reset really
>>>>>>>>> mutually exclusive and not like now.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>>>> Hi Emily,
>>>>>>>>>>
>>>>>>>>>> you need to print which scheduler instance is freeing the jobs
>>>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>>>> completely meaningless here since we are called from different
>>>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>>>
>>>>>>>>>> Apart from that I will look into this a bit deeper when I have
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>      I add the follow print in function drm_sched_cleanup_jobs.
>>>>>>>>>>>  From the log it shows that only use cancel_delayed_work could
>>>>>>>>>>> not avoid to free job when the sched is in reset. But don’t
>>>>>>>>>>> know exactly where it is wrong about the driver. Do you have
>>>>>>>>>>> any suggestion about this?
>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>>>          /*
>>>>>>>>>>>           * Don't destroy jobs while the timeout worker is
>>>>>>>>>>> running  OR thread
>>>>>>>>>>>           * is being parked and hence assumed to not touch
>>>>>>>>>>> ring_mirror_list
>>>>>>>>>>>           */
>>>>>>>>>>>           if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>>>                  return;
>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>>>>>>> current->tgid, current->pid);
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>> ring
>>>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12
>>>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20
>>>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>>>> pid:2253
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, tid:2262,
>>>>>>>>>>> pid:2262
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>>>>> Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>> Thinking more about this claim - we assume here that if
>>>>>>>>>>> cancel_delayed_work
>>>>>>>>>>>> returned true it guarantees that timeout work is not running
>>>>>>>>>>> but, it merely
>>>>>>>>>>>> means there was a pending timeout work which was removed
>>> from
>>>>>>>>>>>> the workqueue before it's timer elapsed and so it didn't have
>>>>>>>>>>>> a
>>>>>>>>>>> chance to be
>>>>>>>>>>>> dequeued and executed, it doesn't cover already executing
>>>>>>>>>>> work. So there is a
>>>>>>>>>>>> possibility where while timeout work started executing another
>>>>>>>>>>> timeout work
>>>>>>>>>>>> already got enqueued (maybe through earlier cleanup jobs or
>>>>>>>>>>> through
>>>>>>>>>>>> drm_sched_fault) and if at this point another
>>>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>>>>> cancel_delayed_work(&sched->work_tdr) will return true even
>>>>>>>>>>> while there is a
>>>>>>>>>>>> timeout job in progress.
>>>>>>>>>>>> Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>>>>> cancel_delayed_work_sync to flush the timeout work as timeout
>>>>>>>>>>> work itself
>>>>>>>>>>>> waits for schedule thread  to be parked again when calling
>>>>>>>>>>> park_thread.
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>> behalf of
>>>>>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>> Sent: 08 November 2019 05:35:18
>>>>>>>>>>>> To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>
>>>>>>>>>>>> exactly that can't happen. See here:
>>>>>>>>>>>>
>>>>>>>>>>>>>           /* Don't destroy jobs while the timeout worker is
>>>>>>>>>>> running */
>>>>>>>>>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>>>>>>                   return NULL;
>>>>>>>>>>>> We never free jobs while the timeout working is running to
>>>>>>>>>>> prevent exactly
>>>>>>>>>>>> that issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>         The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>>>>>>>>>> amdgpu_device_gpu_recover. I mean the main scheduler free the
>>>>>>>>>>> jobs while
>>>>>>>>>>>> in amdgpu_device_gpu_recover, and before calling
>>> drm_sched_stop.
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> well who is calling amdgpu_device_gpu_recover() in this 
>>>>>>>>>>>>>> case?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When it's not the scheduler we shouldn't have a guilty job
>>>>>>>>>>> in the first place.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>>>>>>>> Hi Chrisitan,
>>>>>>>>>>>>>>>          No, I am with the new branch and also has the
>>>>>>>>>>> patch. Even it
>>>>>>>>>>>>>>> are freed by
>>>>>>>>>>>>>> main scheduler, how we could avoid main scheduler to free
>>>>>>>>>>> jobs while
>>>>>>>>>>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> in this case you are on an old code branch.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jobs are freed now by the main scheduler thread and only
>>>>>>>>>>> if no
>>>>>>>>>>>>>>>> timeout handler is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See this patch here:
>>>>>>>>>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>           Please refer to follow log, when it enter to
>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>>>>>>>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>>>> function
>>>>>>>>>>>>>>>> amdgpu_job_free_cb at the same time, because of the
>>>>>>>>>>> hardware fence
>>>>>>>>>>>>>> signal.
>>>>>>>>>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case,
>>>>>>>>>>>>>>>> the s_fence is already freed, but job is not freed in 
>>>>>>>>>>>>>>>> time.
>>>>>>>>>>> Then this issue
>>>>>>>>>>>> occurs.
>>>>>>>>>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>>>> *ERROR* ring
>>>>>>>>>>>>>> sdma0
>>>>>>>>>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>>>>>>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>>> *ERROR*
>>>>>>>>>>>>>>>>> Process
>>>>>>>>>>> information:
>>>>>>>>>>>>>>>> process  pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>>>> 449.794163]
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>>>> 449.794221]
>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>> thread pid 0,
>>>>>>>>>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>> information: process pid 0 thread pid 0,
>>>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>>>>>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>>>> dereference
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>>> [  449.801040]
>>>>>>>>>>> Oops:
>>>>>>>>>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted:
>>>>>>>>>>> G OE
>>>>>>>>>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX
>>> +
>>>>>>>>>>> PIIX,
>>>>>>>>>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>>>>>>>>>> 449.802944]
>>>>>>>>>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>>> [
>>>>>>>>>>>>>>>>> 449.803488]
>>>>>>>>>>>>>> RIP:
>>>>>>>>>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85
>>>>>>>>>>> 56 ff ff
>>>>>>>>>>>>>>>>> ff
>>>>>>>>>>>>>>>>> 45 85 e4 0f
>>>>>>>>>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48
>>>>>>>>>>> 8b 40 10
>>>>>>>>>>>>>>>> <48> 8b
>>>>>>>>>>>>>> 98
>>>>>>>>>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>>>> 48 a8 01
>>>>>>>>>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>>>> 00010286 [
>>>>>>>>>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX:
>>> 0000000000000000
>>>>>>>>>>> RCX:
>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>>> RSI:
>>>>>>>>>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224]
>>> RBP:
>>>>>>>>>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>> 449.807818] R10: 0000000000000000 R11:
>>> 0000000000000148
>>>>>>>>>>> R12:
>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>>> R14:
>>>>>>>>>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>>>>>>>>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 0000
>>>>>>>>>>> ES: 0000
>>>>>>>>>>>> CR0:
>>>>>>>>>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>>> CR3:
>>>>>>>>>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747]
>>> DR0:
>>>>>>>>>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>> 449.811344] DR3: 0000000000000000 DR6:
>>> 00000000fffe0ff0 DR7:
>>>>>>>>>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140
>>> [amdgpu]
>>>>>>>>>>>>>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>>> [amd_sched] [ 449.813139]  ?
>>>>>>>>>>>>>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>>>> [amdgpu] [
>>>>>>>>>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>> [amd_sched] [
>>>>>>>>>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>>>>>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>>>> 449.815374]  ?
>>>>>>>>>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>>>>>>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can't follow how it would be possible for
>>>>>>>>>>> job->s_fence to be
>>>>>>>>>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So it looks like this patch is just papering over some
>>>>>>>>>>> bigger issues.
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>>>>>>>>>> You need to give me at least enough time to wake up
>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>> From: amd-gfx
>>>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>>>>>>> Behalf
>>>>>>>>>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>> From: Christian König
>>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is
>>>>>>>>>>> freed.
>>>>>>>>>>>>>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job
>>>>>>>>>>> is destroyed.
>>>>>>>>>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup.
>>>>>>>>>>> But in one
>>>>>>>>>>>>>>>>>>>>>> case, when it enter into the
>>>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>>>>>>>>>>>>>>> already in drm_sched_job_cleanup, and at this time,
>>>>>>>>>>> it will
>>>>>>>>>>>>>>>>>>>>>> go to free
>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>>>> faster. At
>>>>>>>>>>>>>>>>>>>>>> that time, job is not freed, but s_fence is already
>>>>>>>>>>> NULL.
>>>>>>>>>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>> s_fence.
>>>>>>>>>>>>>>>>>>>> So you are just papering over a much bigger problem
>>>>>>>>>>> here. This
>>>>>>>>>>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> When you see a job without an s_fence then that
>>>>>>>>>>> means the
>>>>>>>>>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> |  2
>>>>>>>>>>>>>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c |
>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>> ++++++---
>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6
>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>>>>>>>>>> ---
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>> +++
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>              * job->base holds a reference to
>>>>>>>>>>> parent fence
>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>>>>>>>>>>> &&
>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>>> parent))
>>>>>>>>>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>>>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>>>>>>> &rq-
>>>>>>>>>>>>>>>>>>> entities,
>>>>>>>>>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>>>>>>>>>> - if
>>>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>>>>>>>>>>> ==
>>>>>>>>>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>>>>>>>>> bad->sched-
>>>>>>>>>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>> (entity-
>>>>>>>>>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>> * This iteration is thread safe as
>>>>>>>>>>> sched thread
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> stopped.
>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>> +          if (s_job->s_fence &&
>>>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>>>>>>> @@ -
>>>>>>>>>>>>>>>>>> 395,7
>>>>>>>>>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>>>>>>> least 1
>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>>>>>>>>>>>> finished,
>>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>> during @@
>>>>>>>>>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>> *sched,
>>>>>>>>>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>>>>>>>>>              */
>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>> - struct dma_fence *fence =
>>>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>>>>>>>>>>>>>>>>> + struct dma_fence *fence =
>>>>>>>>>>> s_job->s_fence ?
>>>>>>>>>>>>>>>>>>>>>>>> + s_job-
>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>>>>>>>
>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> amd-gfx mailing list amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx