[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Fri Nov 15 04:39:44 UTC 2019

Hi Andrey,
     Currently, I am busying with another issue, maybe will try next week.

Best wishes
Emily Deng


>-----Original Message-----
>From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>Sent: Friday, November 15, 2019 6:14 AM
>To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>
>Attached.
>
>Emily - can you give it a try ?
>
>Andrey
>
>On 11/14/19 3:12 AM, Christian König wrote:
>>> What about instead of peeking at the job to actually remove it from
>>> ring_mirror_list right there,
>> Also an interesting idea. We would need to protect the mirror list
>> with a lock again, but that should be the lesser evil.
>>
>> Maybe prototype that and see if it works or not.
>>
>> Regards,
>> Christian.
>>
>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>
>>>
>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>> Another more fundamental question: Could we get rid of the timeout
>>>> job at all?
>>>
>>>
>>> There are other stuff there besides picking the first unfinished job
>>> which is common for all the drivers - such as freeing guilty signaled
>>> job and rearming the timeout work timer.
>>>
>>>
>>>>
>>>> I mean we used to give this as parameter to the scheduler callback
>>>> because we had the timeout worker in the job, but that is no longer
>>>> the case.
>>>>
>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>         job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>                                        struct drm_sched_job, node);
>>>>
>>>> Why don't we just remove that here and only get the first job after
>>>> we have stopped the scheduler?
>>>
>>>
>>> Should be ok since we have the extra check for __kthread_should_park
>>> in drm_sched_cleanup_jobs which will protect us in this case from a
>>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs
>>> after we already parked it. The problem here is we need the
>>> drm_sched_job to access the private data for each client driver (see
>>> amdgpu_job_timedout for example). What about instead of peeking at
>>> the job to actually remove it from ring_mirror_list right there, go
>>> ahead with it through the reset routine, if it's signaled in the
>>> meanwhile that great - release it, otherwise put it back into
>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>
>>>>> This why I asked for a trace with timer enabled, but since there is
>>>>> a finite number of places we touch the timer Emily can just put
>>>>> prints there. Also, I wonder if this temp fix helps her with the
>>>>> issue or not.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>> The question is where do we rearm the timer for this problem to
>>>>>> occur?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>> simulate_crash.patch - waiting on guilty job to signal in reset
>>>>>>> work and artificially rearming the timeout timer just before the
>>>>>>> check for !cancel_delayed_work(&sched->work_tdr)  in
>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I
>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>
>>>>>>> basic_fix.patch handles this by testing whether another timer
>>>>>>> already armed ob this scheduler or is there a timeout work in
>>>>>>> execution right now (see documentation for work_busy) - obviously
>>>>>>> this is not a full solution as this will not protect from races
>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>> drm_sched_fault -  so we probably need to account for this by
>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>> iterates ring mirror list and frees jobs) and GPU reset really
>>>>>>> mutually exclusive and not like now.
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>> Hi Emily,
>>>>>>>>
>>>>>>>> you need to print which scheduler instance is freeing the jobs
>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>> completely meaningless here since we are called from different
>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>
>>>>>>>> Apart from that I will look into this a bit deeper when I have
>>>>>>>> time.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>> Hi Christian,
>>>>>>>>>     I add the follow print in function drm_sched_cleanup_jobs.
>>>>>>>>> From the log it shows that only use cancel_delayed_work could
>>>>>>>>> not avoid to free job when the sched is in reset. But don’t
>>>>>>>>> know exactly where it is wrong about the driver. Do you have
>>>>>>>>> any suggestion about this?
>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>         /*
>>>>>>>>>          * Don't destroy jobs while the timeout worker is
>>>>>>>>> running  OR thread
>>>>>>>>>          * is being parked and hence assumed to not touch
>>>>>>>>> ring_mirror_list
>>>>>>>>>          */
>>>>>>>>>          if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>                 return;
>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>>>>> current->tgid, current->pid);
>>>>>>>>> Best wishes
>>>>>>>>> Emily Deng
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>ring
>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12
>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20
>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>> pid:2253
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, tid:2262,
>>>>>>>>> pid:2262
>>>>>>>>> >-----Original Message-----
>>>>>>>>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>> >Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>> >To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>>>> ><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>> for tdr
>>>>>>>>> >
>>>>>>>>> >Thinking more about this claim - we assume here that if
>>>>>>>>> cancel_delayed_work
>>>>>>>>> >returned true it guarantees that timeout work is not running
>>>>>>>>> but, it merely
>>>>>>>>> >means there was a pending timeout work which was removed
>from
>>>>>>>>> >the workqueue before it's timer elapsed and so it didn't have
>>>>>>>>> >a
>>>>>>>>> chance to be
>>>>>>>>> >dequeued and executed, it doesn't cover already executing
>>>>>>>>> work. So there is a
>>>>>>>>> >possibility where while timeout work started executing another
>>>>>>>>> timeout work
>>>>>>>>> >already got enqueued (maybe through earlier cleanup jobs or
>>>>>>>>> through
>>>>>>>>> >drm_sched_fault) and if at this point another
>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>> >cancel_delayed_work(&sched->work_tdr) will return true even
>>>>>>>>> while there is a
>>>>>>>>> >timeout job in progress.
>>>>>>>>> >Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>> >cancel_delayed_work_sync to flush the timeout work as timeout
>>>>>>>>> work itself
>>>>>>>>> >waits for schedule thread  to be parked again when calling
>>>>>>>>> park_thread.
>>>>>>>>> >
>>>>>>>>> >Andrey
>>>>>>>>> >
>>>>>>>>> >________________________________________
>>>>>>>>> >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>> behalf of
>>>>>>>>> >Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> >Sent: 08 November 2019 05:35:18
>>>>>>>>> >To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>> for tdr
>>>>>>>>> >
>>>>>>>>> >Hi Emily,
>>>>>>>>> >
>>>>>>>>> >exactly that can't happen. See here:
>>>>>>>>> >
>>>>>>>>> >>         /* Don't destroy jobs while the timeout worker is
>>>>>>>>> running */
>>>>>>>>> >>         if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>> >> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>> >>                 return NULL;
>>>>>>>>> >
>>>>>>>>> >We never free jobs while the timeout working is running to
>>>>>>>>> prevent exactly
>>>>>>>>> >that issue.
>>>>>>>>> >
>>>>>>>>> >Regards,
>>>>>>>>> >Christian.
>>>>>>>>> >
>>>>>>>>> >Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>> >> Hi Christian,
>>>>>>>>> >>       The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>>>>>>> >amdgpu_device_gpu_recover. I mean the main scheduler free the
>>>>>>>>> jobs while
>>>>>>>>> >in amdgpu_device_gpu_recover, and before calling
>drm_sched_stop.
>>>>>>>>> >>
>>>>>>>>> >> Best wishes
>>>>>>>>> >> Emily Deng
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>> -----Original Message-----
>>>>>>>>> >>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> >>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>> >>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>> for tdr
>>>>>>>>> >>>
>>>>>>>>> >>> Hi Emily,
>>>>>>>>> >>>
>>>>>>>>> >>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>>>> >>>
>>>>>>>>> >>> When it's not the scheduler we shouldn't have a guilty job
>>>>>>>>> in the first place.
>>>>>>>>> >>>
>>>>>>>>> >>> Regards,
>>>>>>>>> >>> Christian.
>>>>>>>>> >>>
>>>>>>>>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>> >>>> Hi Chrisitan,
>>>>>>>>> >>>>        No, I am with the new branch and also has the
>>>>>>>>> patch. Even it
>>>>>>>>> >>>> are freed by
>>>>>>>>> >>> main scheduler, how we could avoid main scheduler to free
>>>>>>>>> jobs while
>>>>>>>>> >>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>> >>>> Best wishes
>>>>>>>>> >>>> Emily Deng
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>>> -----Original Message-----
>>>>>>>>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> >>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>> >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>> >gfx at lists.freedesktop.org
>>>>>>>>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>> issue for tdr
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Hi Emily,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> in this case you are on an old code branch.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Jobs are freed now by the main scheduler thread and only
>>>>>>>>> if no
>>>>>>>>> >>>>> timeout handler is running.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> See this patch here:
>>>>>>>>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>> >>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>> >>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>>       drm/scheduler: rework job destruction
>>>>>>>>> >>>>> Regards,
>>>>>>>>> >>>>> Christian.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>> >>>>>> Hi Christian,
>>>>>>>>> >>>>>>         Please refer to follow log, when it enter to
>>>>>>>>> >>>>>>amdgpu_device_gpu_recover
>>>>>>>>> >>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>> function
>>>>>>>>> >>>>> amdgpu_job_free_cb  at the same time, because of the
>>>>>>>>> hardware fence
>>>>>>>>> >>> signal.
>>>>>>>>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case,
>>>>>>>>> >>>>> the s_fence is already freed, but job is not freed in time.
>>>>>>>>> Then this issue
>>>>>>>>> >occurs.
>>>>>>>>> >>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>> *ERROR* ring
>>>>>>>>> >>> sdma0
>>>>>>>>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>> >>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>*ERROR*
>>>>>>>>> >>>>>> Process
>>>>>>>>> information:
>>>>>>>>> >>>>> process  pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>> 449.794163]
>>>>>>>>> >>>>> amdgpu
>>>>>>>>> >>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>> >>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>> information:
>>>>>>>>> >>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>> 449.794221]
>>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>> pid 0
>>>>>>>>> >>>>>> thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>> pid 0
>>>>>>>>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>> pid 0
>>>>>>>>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>> pid 0
>>>>>>>>> >>>>>> thread pid 0,
>>>>>>>>> >>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>> >>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>> >>>>> information: process pid 0 thread  pid 0,
>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>> dereference
>>>>>>>>> >>>>> at
>>>>>>>>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>[  449.801040]
>>>>>>>>> Oops:
>>>>>>>>> >>>>> 0000 [#1] SMP PTI
>>>>>>>>> >>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted:
>>>>>>>>> G OE
>>>>>>>>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>> >>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX
>+
>>>>>>>>> PIIX,
>>>>>>>>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>> >>>>>> 449.802944]
>>>>>>>>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>[
>>>>>>>>> >>>>>> 449.803488]
>>>>>>>>> >>> RIP:
>>>>>>>>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>> >>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85
>>>>>>>>> 56 ff ff
>>>>>>>>> >>>>>> ff
>>>>>>>>> >>>>>> 45 85 e4 0f
>>>>>>>>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48
>>>>>>>>> 8b 40 10
>>>>>>>>> >>>>> <48> 8b
>>>>>>>>> >>> 98
>>>>>>>>> >>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>> 48 a8 01
>>>>>>>>> >>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>> 00010286 [
>>>>>>>>> >>>>>> 449.806032] RAX: 0000000000000000 RBX:
>0000000000000000
>>>>>>>>> RCX:
>>>>>>>>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>RSI:
>>>>>>>>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224]
>RBP:
>>>>>>>>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>> 0000000000000000 [
>>>>>>>>> >>>>>> 449.807818] R10: 0000000000000000 R11:
>0000000000000148
>>>>>>>>> R12:
>>>>>>>>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>R14:
>>>>>>>>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>>>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 0000
>>>>>>>>> ES: 0000
>>>>>>>>> >CR0:
>>>>>>>>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>CR3:
>>>>>>>>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747]
>DR0:
>>>>>>>>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>> 0000000000000000 [
>>>>>>>>> >>>>> 449.811344] DR3: 0000000000000000 DR6:
>00000000fffe0ff0 DR7:
>>>>>>>>> >>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>> >>>>>> [  449.812206] amdgpu_job_timedout+0x114/0x140
>[amdgpu]
>>>>>>>>> >>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>> >>>>>> [amd_sched] [ 449.813139]  ?
>>>>>>>>> >>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>> [amdgpu] [
>>>>>>>>> >>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>> [amd_sched] [
>>>>>>>>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>>>> >>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>> >>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>> 449.815374]  ?
>>>>>>>>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>> >>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>>> -----Original Message-----
>>>>>>>>> >>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>> >>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>> >>> gfx at lists.freedesktop.org
>>>>>>>>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>> issue for
>>>>>>>>> >>>>>>> tdr
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>> >>>>>>>> Sorry, please take your time.
>>>>>>>>> >>>>>>> Have you seen my other response a bit below?
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> I can't follow how it would be possible for
>>>>>>>>> job->s_fence to be
>>>>>>>>> >>>>>>> NULL without the job also being freed.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> So it looks like this patch is just papering over some
>>>>>>>>> bigger issues.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Regards,
>>>>>>>>> >>>>>>> Christian.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>>> Best wishes
>>>>>>>>> >>>>>>>> Emily Deng
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>> -----Original Message-----
>>>>>>>>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>> >>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>> >>>>> gfx at lists.freedesktop.org
>>>>>>>>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>> issue for
>>>>>>>>> >>>>>>>>> tdr
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>> >>>>>>>>>> Ping.....
>>>>>>>>> >>>>>>>>> You need to give me at least enough time to wake up
>>>>>>>>> >>>>>>>>> :)
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>>> Best wishes
>>>>>>>>> >>>>>>>>>> Emily Deng
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>>> -----Original Message-----
>>>>>>>>> >>>>>>>>>>> From: amd-gfx
>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>> >>> Behalf
>>>>>>>>> >>>>>>>>>>> Of Deng, Emily
>>>>>>>>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>;
>>>>>>>>> >>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>> pointer issue
>>>>>>>>> >>>>>>>>>>> for tdr
>>>>>>>>> >>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> -----Original Message-----
>>>>>>>>> >>>>>>>>>>>> From: Christian König
>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>> >>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>> pointer issue
>>>>>>>>> >>>>>>>>>>>> for tdr
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is
>>>>>>>>> freed.
>>>>>>>>> >>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job
>>>>>>>>> is destroyed.
>>>>>>>>> >>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup.
>>>>>>>>> But in one
>>>>>>>>> >>>>>>>>>>> case, when it enter into the
>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time,
>>>>>>>>> it will
>>>>>>>>> >>>>>>>>>>> go to free
>>>>>>>>> >>>>> job.
>>>>>>>>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>> faster. At
>>>>>>>>> >>>>>>>>>>> that time, job is not freed, but s_fence is already
>>>>>>>>> NULL.
>>>>>>>>> >>>>>>>>> No, that case can't happen. See here:
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>> >>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>> >>>>>>>>>> kfree(job);
>>>>>>>>> >>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>> reference
>>>>>>>>> >>>>>>>>> to the
>>>>>>>>> >>>>> s_fence.
>>>>>>>>> >>>>>>>>> So you are just papering over a much bigger problem
>>>>>>>>> here. This
>>>>>>>>> >>>>>>>>> patch is a clear NAK.
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> Regards,
>>>>>>>>> >>>>>>>>> Christian.
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> When you see a job without an s_fence then that
>>>>>>>>> means the
>>>>>>>>> >>>>>>>>>>>> problem is somewhere else.
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> Regards,
>>>>>>>>> >>>>>>>>>>>> Christian.
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>> >>>>>>>>>>>>> ---
>>>>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>|  2
>>>>>>>>> >>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c     |
>>>>>>>>> >>>>>>>>>>>>> 11
>>>>>>>>> ++++++---
>>>>>>>>> >--
>>>>>>>>> >>>>>>>>>>>>>       2 files changed, 7 insertions(+), 6
>>>>>>>>> deletions(-)
>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>>> diff --git
>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> >>>>>>>>>>>>>
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>> >>>>>>>>>>>>> ---
>a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> >>>>>>>>>>>>> +++
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>> >>> amdgpu_device_gpu_recover(struct
>>>>>>>>> >>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>> >>>>>>>>>>>>>            *
>>>>>>>>> >>>>>>>>>>>>>            * job->base holds a reference to
>>>>>>>>> parent fence
>>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>>> >>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>> >>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>>> >>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>> >>>>>>> &&
>>>>>>>>> >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>parent))
>>>>>>>>> >>>>>>>>>>>>> job_signaled = true;
>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>>> diff --git
>>>>>>>>> >>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>> >drm_sched_increase_karma(struct
>>>>>>>>> >>>>>>>>>>>> drm_sched_job
>>>>>>>>> >>>>>>>>>>>>> *bad)
>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>> >>> &rq-
>>>>>>>>> >>>>>>>> entities,
>>>>>>>>> >>>>>>>>>>>> list) {
>>>>>>>>> >>>>>>>>>>>>> -                          if
>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>> >>>>>>> ==
>>>>>>>>> >>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>> >>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>>> >>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>> >>>>>>>>>>>>> scheduled.context ==
>>>>>>>>> >>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>> >>>>>>>>>>>>> if
>>>>>>>>> >>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>> >>>>>>>> karma) >
>>>>>>>>> >>>>>>>>>>>>> bad->sched-
>>>>>>>>> >>>> hang_limit)
>>>>>>>>> >>>>>>>>>>>>> if
>>>>>>>>> >>>>>>>>>>>>> (entity-
>>>>>>>>> >>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>> >>>>>>> drm_gpu_scheduler
>>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>> >>>>>>>>>>>>>            * This iteration is thread safe as
>>>>>>>>> sched thread
>>>>>>>>> >>>>>>>>>>>>> is
>>>>>>>>> >>> stopped.
>>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>> >>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>> >>>>>>>>>>>>> +          if (s_job->s_fence &&
>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>> >>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>> >>>> s_fence-
>>>>>>>>> >>>>>>>> parent,
>>>>>>>>> >>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>> >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>> >>> @@ -
>>>>>>>>> >>>>>>> 395,7
>>>>>>>>> >>>>>>>>>>> +395,8 @@ void
>>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>> >>>>>>>>>>>>> *
>>>>>>>>> >>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>> >>> least 1
>>>>>>>>> >>>>>>>>>>>>> */
>>>>>>>>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>> >>>>>>> false);
>>>>>>>>> >>>>>>>>>>>>> +                  if (s_job->s_fence)
>>>>>>>>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>> >>>>>>>> finished,
>>>>>>>>> >>>>>>>>>>>> false);
>>>>>>>>> >>>>>>>>>>>>> /*
>>>>>>>>> >>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>> >>> use
>>>>>>>>> >>>>>>> during @@
>>>>>>>>> >>>>>>>>>>>> -438,7
>>>>>>>>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>> drm_gpu_scheduler
>>>>>>>>> >>>>> *sched,
>>>>>>>>> >>>>>>>>>>>>> +bool
>>>>>>>>> >>>>>>>>>>>> full_recovery)
>>>>>>>>> >>>>>>>>>>>>>            * GPU recovers can't run in parallel.
>>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>> >>>>>>>>>>>>>&sched->ring_mirror_list,
>>>>>>>>> >>>>>>>>>>>>> node)
>>>>>>>>> >>>>>>>>>>>> {
>>>>>>>>> >>>>>>>>>>>>> -          struct dma_fence *fence =
>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>> >>>>>>>>>>>>> +          struct dma_fence *fence =
>>>>>>>>> s_job->s_fence ?
>>>>>>>>> >>>>>>>>>>>>> + s_job-
>>>>>>>>> >>>>>>>> s_fence-
>>>>>>>>> >>>>>>>>>>>>> parent :
>>>>>>>>> >>>>>>>>>>>>> +NULL;
>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>
>_______________________________________________
>>>>>>>>> >>>>>>>>>>> amd-gfx mailing list amd-gfx at lists.freedesktop.org
>>>>>>>>> >>>>>>>>>>>
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>> >
>>>>>>>>> >_______________________________________________
>>>>>>>>> >amd-gfx mailing list
>>>>>>>>> >amd-gfx at lists.freedesktop.org
>>>>>>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>