[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Thu Nov 14 22:14:20 UTC 2019

Attached.

Emily - can you give it a try ?

Andrey

On 11/14/19 3:12 AM, Christian König wrote:
>> What about instead of peeking at the job to actually remove it from 
>> ring_mirror_list right there,
> Also an interesting idea. We would need to protect the mirror list 
> with a lock again, but that should be the lesser evil.
>
> Maybe prototype that and see if it works or not.
>
> Regards,
> Christian.
>
> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>
>>
>> On 11/13/19 9:20 AM, Christian König wrote:
>>> Another more fundamental question: Could we get rid of the timeout 
>>> job at all?
>>
>>
>> There are other stuff there besides picking the first unfinished job 
>> which is common for all the drivers - such as freeing guilty signaled 
>> job and rearming the timeout work timer.
>>
>>
>>>
>>> I mean we used to give this as parameter to the scheduler callback 
>>> because we had the timeout worker in the job, but that is no longer 
>>> the case.
>>>
>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>         job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>                                        struct drm_sched_job, node);
>>>
>>> Why don't we just remove that here and only get the first job after 
>>> we have stopped the scheduler?
>>
>>
>> Should be ok since we have the extra check for __kthread_should_park 
>> in drm_sched_cleanup_jobs which will protect us in this case from a 
>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs 
>> after we already parked it. The problem here is we need the 
>> drm_sched_job to access the private data for each client driver (see 
>> amdgpu_job_timedout for example). What about instead of peeking at 
>> the job to actually remove it from ring_mirror_list right there, go 
>> ahead with it through the reset routine, if it's signaled in the 
>> meanwhile that great - release it, otherwise put it back into 
>> ring_mirror_list in drm_sched_resubmit_jobs.
>>
>> Andrey
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>
>>>> This why I asked for a trace with timer enabled, but since there is 
>>>> a finite number of places we touch the timer Emily can just put 
>>>> prints there. Also, I wonder if this temp fix helps her with the 
>>>> issue or not.
>>>>
>>>> Andrey
>>>>
>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>> The question is where do we rearm the timer for this problem to 
>>>>> occur?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> I was able to reproduce the crash by using the attached 
>>>>>> simulate_crash.patch - waiting on guilty job to signal in reset 
>>>>>> work and artificially rearming the timeout timer just before the 
>>>>>> check for !cancel_delayed_work(&sched->work_tdr)  in 
>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I 
>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>
>>>>>> basic_fix.patch handles this by testing whether another timer 
>>>>>> already armed ob this scheduler or is there a timeout work in 
>>>>>> execution right now (see documentation for work_busy) - 
>>>>>> obviously  this is not a full solution as this will not protect 
>>>>>> from races if for example there is immediate work scheduling such 
>>>>>> as in drm_sched_fault -  so we probably need to account for this 
>>>>>> by making drm_sched_cleanup_jobs (at least in the part where it 
>>>>>> iterates ring mirror list and frees jobs) and GPU reset really 
>>>>>> mutually exclusive and not like now.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>> Hi Emily,
>>>>>>>
>>>>>>> you need to print which scheduler instance is freeing the jobs 
>>>>>>> and which one is triggering the reset. The TID and PID is 
>>>>>>> completely meaningless here since we are called from different 
>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>
>>>>>>> Apart from that I will look into this a bit deeper when I have 
>>>>>>> time.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>> Hi Christian,
>>>>>>>>     I add the follow print in function drm_sched_cleanup_jobs. 
>>>>>>>> From the log it shows that only use cancel_delayed_work could 
>>>>>>>> not avoid to free job when the sched is in reset. But don’t 
>>>>>>>> know exactly where it is wrong about the driver. Do you have 
>>>>>>>> any suggestion about this?
>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu, 
>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>         /*
>>>>>>>>          * Don't destroy jobs while the timeout worker is 
>>>>>>>> running  OR thread
>>>>>>>>          * is being parked and hence assumed to not touch 
>>>>>>>> ring_mirror_list
>>>>>>>>          */
>>>>>>>>          if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>                 return;
>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n", 
>>>>>>>> current->tgid, current->pid);
>>>>>>>> Best wishes
>>>>>>>> Emily Deng
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring 
>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* 
>>>>>>>> Process information: process pid 0 thread pid 0, 
>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information: 
>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253, 
>>>>>>>> pid:2253
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information: 
>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, tid:2262, 
>>>>>>>> pid:2262
>>>>>>>> >-----Original Message-----
>>>>>>>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>> >Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>> >To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>>> ><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue 
>>>>>>>> for tdr
>>>>>>>> >
>>>>>>>> >Thinking more about this claim - we assume here that if 
>>>>>>>> cancel_delayed_work
>>>>>>>> >returned true it guarantees that timeout work is not running 
>>>>>>>> but, it merely
>>>>>>>> >means there was a pending timeout work which was removed from the
>>>>>>>> >workqueue before it's timer elapsed and so it didn't have a 
>>>>>>>> chance to be
>>>>>>>> >dequeued and executed, it doesn't cover already executing 
>>>>>>>> work. So there is a
>>>>>>>> >possibility where while timeout work started executing another 
>>>>>>>> timeout work
>>>>>>>> >already got enqueued (maybe through earlier cleanup jobs or 
>>>>>>>> through
>>>>>>>> >drm_sched_fault) and if at this point another 
>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>> >cancel_delayed_work(&sched->work_tdr) will return true even 
>>>>>>>> while there is a
>>>>>>>> >timeout job in progress.
>>>>>>>> >Unfortunately we cannot change cancel_delayed_work to
>>>>>>>> >cancel_delayed_work_sync to flush the timeout work as timeout 
>>>>>>>> work itself
>>>>>>>> >waits for schedule thread  to be parked again when calling 
>>>>>>>> park_thread.
>>>>>>>> >
>>>>>>>> >Andrey
>>>>>>>> >
>>>>>>>> >________________________________________
>>>>>>>> >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on 
>>>>>>>> behalf of
>>>>>>>> >Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> >Sent: 08 November 2019 05:35:18
>>>>>>>> >To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue 
>>>>>>>> for tdr
>>>>>>>> >
>>>>>>>> >Hi Emily,
>>>>>>>> >
>>>>>>>> >exactly that can't happen. See here:
>>>>>>>> >
>>>>>>>> >>         /* Don't destroy jobs while the timeout worker is 
>>>>>>>> running */
>>>>>>>> >>         if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>> >> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>> >>                 return NULL;
>>>>>>>> >
>>>>>>>> >We never free jobs while the timeout working is running to 
>>>>>>>> prevent exactly
>>>>>>>> >that issue.
>>>>>>>> >
>>>>>>>> >Regards,
>>>>>>>> >Christian.
>>>>>>>> >
>>>>>>>> >Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>> >> Hi Christian,
>>>>>>>> >>       The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>>>>>> >amdgpu_device_gpu_recover. I mean the main scheduler free the 
>>>>>>>> jobs while
>>>>>>>> >in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>>>>>> >>
>>>>>>>> >> Best wishes
>>>>>>>> >> Emily Deng
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>> -----Original Message-----
>>>>>>>> >>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> >>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>> >>> To: Deng, Emily <Emily.Deng at amd.com>; 
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue 
>>>>>>>> for tdr
>>>>>>>> >>>
>>>>>>>> >>> Hi Emily,
>>>>>>>> >>>
>>>>>>>> >>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>>> >>>
>>>>>>>> >>> When it's not the scheduler we shouldn't have a guilty job 
>>>>>>>> in the first place.
>>>>>>>> >>>
>>>>>>>> >>> Regards,
>>>>>>>> >>> Christian.
>>>>>>>> >>>
>>>>>>>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>> >>>> Hi Chrisitan,
>>>>>>>> >>>>        No, I am with the new branch and also has the 
>>>>>>>> patch. Even it
>>>>>>>> >>>> are freed by
>>>>>>>> >>> main scheduler, how we could avoid main scheduler to free 
>>>>>>>> jobs while
>>>>>>>> >>> enter to function amdgpu_device_gpu_recover?
>>>>>>>> >>>> Best wishes
>>>>>>>> >>>> Emily Deng
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>> -----Original Message-----
>>>>>>>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> >>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>> >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>> >gfx at lists.freedesktop.org
>>>>>>>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>> issue for tdr
>>>>>>>> >>>>>
>>>>>>>> >>>>> Hi Emily,
>>>>>>>> >>>>>
>>>>>>>> >>>>> in this case you are on an old code branch.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Jobs are freed now by the main scheduler thread and only 
>>>>>>>> if no
>>>>>>>> >>>>> timeout handler is running.
>>>>>>>> >>>>>
>>>>>>>> >>>>> See this patch here:
>>>>>>>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>> >>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>> >>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>       drm/scheduler: rework job destruction
>>>>>>>> >>>>> Regards,
>>>>>>>> >>>>> Christian.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>> >>>>>> Hi Christian,
>>>>>>>> >>>>>>         Please refer to follow log, when it enter to
>>>>>>>> >>>>>> amdgpu_device_gpu_recover
>>>>>>>> >>>>> function, the bad job 000000005086879e is freeing in 
>>>>>>>> function
>>>>>>>> >>>>> amdgpu_job_free_cb  at the same time, because of the 
>>>>>>>> hardware fence
>>>>>>>> >>> signal.
>>>>>>>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>>>> >>>>> s_fence is already freed, but job is not freed in time. 
>>>>>>>> Then this issue
>>>>>>>> >occurs.
>>>>>>>> >>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] 
>>>>>>>> *ERROR* ring
>>>>>>>> >>> sdma0
>>>>>>>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>>>>>> >>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
>>>>>>>> information:
>>>>>>>> >>>>> process  pid 0 thread pid 0, s_job:000000005086879e [  
>>>>>>>> 449.794163]
>>>>>>>> >>>>> amdgpu
>>>>>>>> >>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>> >>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process 
>>>>>>>> information:
>>>>>>>> >>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [ 
>>>>>>>> 449.794221]
>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process  
>>>>>>>> pid 0
>>>>>>>> >>>>>> thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process  
>>>>>>>> pid 0
>>>>>>>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process  
>>>>>>>> pid 0
>>>>>>>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process  
>>>>>>>> pid 0
>>>>>>>> >>>>>> thread pid 0,
>>>>>>>> >>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>> >>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>> >>>>> information: process pid 0 thread  pid 0, 
>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer 
>>>>>>>> dereference
>>>>>>>> >>>>> at
>>>>>>>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [  449.801040] 
>>>>>>>> Oops:
>>>>>>>> >>>>> 0000 [#1] SMP PTI
>>>>>>>> >>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: 
>>>>>>>> G OE
>>>>>>>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>> >>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + 
>>>>>>>> PIIX,
>>>>>>>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>>>>>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>>> >>>>>> 449.803488]
>>>>>>>> >>> RIP:
>>>>>>>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>> >>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 
>>>>>>>> 56 ff ff
>>>>>>>> >>>>>> ff
>>>>>>>> >>>>>> 45 85 e4 0f
>>>>>>>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 
>>>>>>>> 8b 40 10
>>>>>>>> >>>>> <48> 8b
>>>>>>>> >>> 98
>>>>>>>> >>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 
>>>>>>>> 48 a8 01
>>>>>>>> >>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 
>>>>>>>> 00010286 [
>>>>>>>> >>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 
>>>>>>>> RCX:
>>>>>>>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>>>>>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 
>>>>>>>> 0000000000000000 [
>>>>>>>> >>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 
>>>>>>>> R12:
>>>>>>>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 0000 
>>>>>>>> ES: 0000
>>>>>>>> >CR0:
>>>>>>>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>>>>>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>>>>>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2: 
>>>>>>>> 0000000000000000 [
>>>>>>>> >>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>>> >>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>> >>>>>> [  449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>>> >>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>> >>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 
>>>>>>>> [amdgpu] [
>>>>>>>> >>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 
>>>>>>>> [amd_sched] [
>>>>>>>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>>> >>>>>> worker_thread+0x34/0x410 [ 449.814728] 
>>>>>>>> kthread+0x121/0x140 [
>>>>>>>> >>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  
>>>>>>>> 449.815374]  ?
>>>>>>>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>> >>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>> -----Original Message-----
>>>>>>>> >>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>> >>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>> >>> gfx at lists.freedesktop.org
>>>>>>>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>> issue for
>>>>>>>> >>>>>>> tdr
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>> >>>>>>>> Sorry, please take your time.
>>>>>>>> >>>>>>> Have you seen my other response a bit below?
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> I can't follow how it would be possible for 
>>>>>>>> job->s_fence to be
>>>>>>>> >>>>>>> NULL without the job also being freed.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> So it looks like this patch is just papering over some 
>>>>>>>> bigger issues.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Regards,
>>>>>>>> >>>>>>> Christian.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>> Best wishes
>>>>>>>> >>>>>>>> Emily Deng
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>>> -----Original Message-----
>>>>>>>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>> >>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>> >>>>> gfx at lists.freedesktop.org
>>>>>>>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>> issue for
>>>>>>>> >>>>>>>>> tdr
>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>> >>>>>>>>>> Ping.....
>>>>>>>> >>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>>>> Best wishes
>>>>>>>> >>>>>>>>>> Emily Deng
>>>>>>>> >>>>>>>>>>
>>>>>>>> >>>>>>>>>>
>>>>>>>> >>>>>>>>>>
>>>>>>>> >>>>>>>>>>> -----Original Message-----
>>>>>>>> >>>>>>>>>>> From: amd-gfx 
>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>> >>> Behalf
>>>>>>>> >>>>>>>>>>> Of Deng, Emily
>>>>>>>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>>> >>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null 
>>>>>>>> pointer issue
>>>>>>>> >>>>>>>>>>> for tdr
>>>>>>>> >>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>> -----Original Message-----
>>>>>>>> >>>>>>>>>>>> From: Christian König 
>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>> >>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null 
>>>>>>>> pointer issue
>>>>>>>> >>>>>>>>>>>> for tdr
>>>>>>>> >>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is 
>>>>>>>> freed.
>>>>>>>> >>>>>>>>>>>>> Then it will has null pointer in 
>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job 
>>>>>>>> is destroyed.
>>>>>>>> >>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. 
>>>>>>>> But in one
>>>>>>>> >>>>>>>>>>> case, when it enter into the 
>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, 
>>>>>>>> it will
>>>>>>>> >>>>>>>>>>> go to free
>>>>>>>> >>>>> job.
>>>>>>>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is 
>>>>>>>> faster. At
>>>>>>>> >>>>>>>>>>> that time, job is not freed, but s_fence is already 
>>>>>>>> NULL.
>>>>>>>> >>>>>>>>> No, that case can't happen. See here:
>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>> >>>>>>>>>>
>>>>>>>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>> >>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>> >>>>>>>>>> kfree(job);
>>>>>>>> >>>>>>>>> The job itself is freed up directly after freeing the 
>>>>>>>> reference
>>>>>>>> >>>>>>>>> to the
>>>>>>>> >>>>> s_fence.
>>>>>>>> >>>>>>>>> So you are just papering over a much bigger problem 
>>>>>>>> here. This
>>>>>>>> >>>>>>>>> patch is a clear NAK.
>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>>> Regards,
>>>>>>>> >>>>>>>>> Christian.
>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>>>>>> When you see a job without an s_fence then that 
>>>>>>>> means the
>>>>>>>> >>>>>>>>>>>> problem is somewhere else.
>>>>>>>> >>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>> Regards,
>>>>>>>> >>>>>>>>>>>> Christian.
>>>>>>>> >>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>> >>>>>>>>>>>>> ---
>>>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c     | 11 
>>>>>>>> ++++++---
>>>>>>>> >--
>>>>>>>> >>>>>>>>>>>>>       2 files changed, 7 insertions(+), 6 
>>>>>>>> deletions(-)
>>>>>>>> >>>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>>> diff --git 
>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>> >>> amdgpu_device_gpu_recover(struct
>>>>>>>> >>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>> >>>>>>>>>>>>>            *
>>>>>>>> >>>>>>>>>>>>>            * job->base holds a reference to 
>>>>>>>> parent fence
>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>> >>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>> >>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>> >>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>> >>>>>>> &&
>>>>>>>> >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>> >>>>>>>>>>>>> job_signaled = true;
>>>>>>>> >>>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>> >drm_sched_increase_karma(struct
>>>>>>>> >>>>>>>>>>>> drm_sched_job
>>>>>>>> >>>>>>>>>>>>> *bad)
>>>>>>>> >>>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(entity,
>>>>>>>> >>>>>>>>>>>>> tmp,
>>>>>>>> >>> &rq-
>>>>>>>> >>>>>>>> entities,
>>>>>>>> >>>>>>>>>>>> list) {
>>>>>>>> >>>>>>>>>>>>> -                          if 
>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>> >>>>>>> ==
>>>>>>>> >>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>> >>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>> >>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>> >>>>>>>>>>>>> scheduled.context ==
>>>>>>>> >>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>> >>>>>>>>>>>>> if
>>>>>>>> >>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>> >>>>>>>> karma) >
>>>>>>>> >>>>>>>>>>>>> bad->sched-
>>>>>>>> >>>> hang_limit)
>>>>>>>> >>>>>>>>>>>>> if
>>>>>>>> >>>>>>>>>>>>> (entity-
>>>>>>>> >>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>> >>>>>>> drm_gpu_scheduler
>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>> >>>>>>>>>>>>>            * This iteration is thread safe as 
>>>>>>>> sched thread
>>>>>>>> >>>>>>>>>>>>> is
>>>>>>>> >>> stopped.
>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>> >>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>> >>>>>>>>>>>>> +          if (s_job->s_fence && 
>>>>>>>> s_job->s_fence->parent &&
>>>>>>>> >>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>> >>>> s_fence-
>>>>>>>> >>>>>>>> parent,
>>>>>>>> >>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>> >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>> >>> @@ -
>>>>>>>> >>>>>>> 395,7
>>>>>>>> >>>>>>>>>>> +395,8 @@ void
>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>> >>>>>>>>>>>>> *
>>>>>>>> >>>>>>>>>>>>> * Job is still alive so fence
>>>>>>>> >>>>>>>>>>>>> refcount at
>>>>>>>> >>> least 1
>>>>>>>> >>>>>>>>>>>>> */
>>>>>>>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>> >>>>>>> false);
>>>>>>>> >>>>>>>>>>>>> +                  if (s_job->s_fence)
>>>>>>>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>> >>>>>>>> finished,
>>>>>>>> >>>>>>>>>>>> false);
>>>>>>>> >>>>>>>>>>>>> /*
>>>>>>>> >>>>>>>>>>>>> * We must keep bad job alive
>>>>>>>> >>>>>>>>>>>>> for later
>>>>>>>> >>> use
>>>>>>>> >>>>>>> during @@
>>>>>>>> >>>>>>>>>>>> -438,7
>>>>>>>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct 
>>>>>>>> drm_gpu_scheduler
>>>>>>>> >>>>> *sched,
>>>>>>>> >>>>>>>>>>>>> +bool
>>>>>>>> >>>>>>>>>>>> full_recovery)
>>>>>>>> >>>>>>>>>>>>>            * GPU recovers can't run in parallel.
>>>>>>>> >>>>>>>>>>>>>            */
>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>> >>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>> >>>>>>>>>>>>> node)
>>>>>>>> >>>>>>>>>>>> {
>>>>>>>> >>>>>>>>>>>>> -          struct dma_fence *fence = 
>>>>>>>> s_job->s_fence->parent;
>>>>>>>> >>>>>>>>>>>>> +          struct dma_fence *fence = 
>>>>>>>> s_job->s_fence ?
>>>>>>>> >>>>>>>>>>>>> + s_job-
>>>>>>>> >>>>>>>> s_fence-
>>>>>>>> >>>>>>>>>>>>> parent :
>>>>>>>> >>>>>>>>>>>>> +NULL;
>>>>>>>> >>>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>> >>>>>>>>>>>>>
>>>>>>>> >>>>>>>>>>> _______________________________________________
>>>>>>>> >>>>>>>>>>> amd-gfx mailing list
>>>>>>>> >>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> >>>>>>>>>>> 
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx 
>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>> >
>>>>>>>> >_______________________________________________
>>>>>>>> >amd-gfx mailing list
>>>>>>>> >amd-gfx at lists.freedesktop.org
>>>>>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-scheduler-Avoid-accessing-freed-bad-job.patch
Type: text/x-patch
Size: 3225 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191114/a0428fd3/attachment-0001.bin>