[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Wed Nov 13 16:00:19 UTC 2019
On 11/13/19 9:20 AM, Christian König wrote:
> Another more fundamental question: Could we get rid of the timeout job
> at all?
There are other stuff there besides picking the first unfinished job
which is common for all the drivers - such as freeing guilty signaled
job and rearming the timeout work timer.
>
> I mean we used to give this as parameter to the scheduler callback
> because we had the timeout worker in the job, but that is no longer
> the case.
>
> E.g. in drm_sched_job_timedout() we do the following:
>> job = list_first_entry_or_null(&sched->ring_mirror_list,
>> struct drm_sched_job, node);
>
> Why don't we just remove that here and only get the first job after we
> have stopped the scheduler?
Should be ok since we have the extra check for __kthread_should_park in
drm_sched_cleanup_jobs which will protect us in this case from a wakeup
of sched thread and execution of in drm_sched_cleanup_jobs after we
already parked it. The problem here is we need the drm_sched_job to
access the private data for each client driver (see amdgpu_job_timedout
for example). What about instead of peeking at the job to actually
remove it from ring_mirror_list right there, go ahead with it through
the reset routine, if it's signaled in the meanwhile that great -
release it, otherwise put it back into ring_mirror_list in
drm_sched_resubmit_jobs.
Andrey
>
> Regards,
> Christian.
>
> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>
>> This why I asked for a trace with timer enabled, but since there is a
>> finite number of places we touch the timer Emily can just put prints
>> there. Also, I wonder if this temp fix helps her with the issue or not.
>>
>> Andrey
>>
>> On 11/13/19 2:36 AM, Christian König wrote:
>>> The question is where do we rearm the timer for this problem to occur?
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>
>>>> I was able to reproduce the crash by using the attached
>>>> simulate_crash.patch - waiting on guilty job to signal in reset
>>>> work and artificially rearming the timeout timer just before the
>>>> check for !cancel_delayed_work(&sched->work_tdr) in
>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I
>>>> think confirms my theory i described earlier in this thread.
>>>>
>>>> basic_fix.patch handles this by testing whether another timer
>>>> already armed ob this scheduler or is there a timeout work in
>>>> execution right now (see documentation for work_busy) - obviously
>>>> this is not a full solution as this will not protect from races if
>>>> for example there is immediate work scheduling such as in
>>>> drm_sched_fault - so we probably need to account for this by
>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>> iterates ring mirror list and frees jobs) and GPU reset really
>>>> mutually exclusive and not like now.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>> Hi Emily,
>>>>>
>>>>> you need to print which scheduler instance is freeing the jobs and
>>>>> which one is triggering the reset. The TID and PID is completely
>>>>> meaningless here since we are called from different worker threads
>>>>> and the TID/PID can change on each call.
>>>>>
>>>>> Apart from that I will look into this a bit deeper when I have time.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>> Hi Christian,
>>>>>> I add the follow print in function drm_sched_cleanup_jobs.
>>>>>> From the log it shows that only use cancel_delayed_work could not
>>>>>> avoid to free job when the sched is in reset. But don’t know
>>>>>> exactly where it is wrong about the driver. Do you have any
>>>>>> suggestion about this?
>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu, pid:%lu\n",
>>>>>> current->tgid, current->pid);
>>>>>> /*
>>>>>> * Don't destroy jobs while the timeout worker is
>>>>>> running OR thread
>>>>>> * is being parked and hence assumed to not touch
>>>>>> ring_mirror_list
>>>>>> */
>>>>>> if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>> return;
>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>> current->tgid, current->pid);
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>>>>>> information: process pid 0 thread pid 0,
>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information:
>>>>>> process pid 0 thread pid 0, s_job:00000000fe75ab36, tid:2262,
>>>>>> pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253, pid:2253
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information:
>>>>>> process pid 0 thread pid 0, s_job:00000000f086ec84, tid:2262,
>>>>>> pid:2262
>>>>>> >-----Original Message-----
>>>>>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>> >Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>> >To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>> ><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>> >
>>>>>> >Thinking more about this claim - we assume here that if cancel_delayed_work
>>>>>> >returned true it guarantees that timeout work is not running but, it merely
>>>>>> >means there was a pending timeout work which was removed from the
>>>>>> >workqueue before it's timer elapsed and so it didn't have a chance to be
>>>>>> >dequeued and executed, it doesn't cover already executing work. So there is a
>>>>>> >possibility where while timeout work started executing another timeout work
>>>>>> >already got enqueued (maybe through earlier cleanup jobs or through
>>>>>> >drm_sched_fault) and if at this point another drm_sched_cleanup_jobs runs
>>>>>> >cancel_delayed_work(&sched->work_tdr) will return true even while there is a
>>>>>> >timeout job in progress.
>>>>>> >Unfortunately we cannot change cancel_delayed_work to
>>>>>> >cancel_delayed_work_sync to flush the timeout work as timeout work itself
>>>>>> >waits for schedule thread to be parked again when calling park_thread.
>>>>>> >
>>>>>> >Andrey
>>>>>> >
>>>>>> >________________________________________
>>>>>> >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>>>>>> >Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> >Sent: 08 November 2019 05:35:18
>>>>>> >To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>> >
>>>>>> >Hi Emily,
>>>>>> >
>>>>>> >exactly that can't happen. See here:
>>>>>> >
>>>>>> >> /* Don't destroy jobs while the timeout worker is running */
>>>>>> >> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>> >> !cancel_delayed_work(&sched->work_tdr))
>>>>>> >> return NULL;
>>>>>> >
>>>>>> >We never free jobs while the timeout working is running to prevent exactly
>>>>>> >that issue.
>>>>>> >
>>>>>> >Regards,
>>>>>> >Christian.
>>>>>> >
>>>>>> >Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>> >> Hi Christian,
>>>>>> >> The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>>>> >amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while
>>>>>> >in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>>>> >>
>>>>>> >> Best wishes
>>>>>> >> Emily Deng
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>> -----Original Message-----
>>>>>> >>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> >>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>> >>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>> >>>
>>>>>> >>> Hi Emily,
>>>>>> >>>
>>>>>> >>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>> >>>
>>>>>> >>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>>>>>> >>>
>>>>>> >>> Regards,
>>>>>> >>> Christian.
>>>>>> >>>
>>>>>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>> >>>> Hi Chrisitan,
>>>>>> >>>> No, I am with the new branch and also has the patch. Even it
>>>>>> >>>> are freed by
>>>>>> >>> main scheduler, how we could avoid main scheduler to free jobs while
>>>>>> >>> enter to function amdgpu_device_gpu_recover?
>>>>>> >>>> Best wishes
>>>>>> >>>> Emily Deng
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>> -----Original Message-----
>>>>>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> >>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>> >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>> >gfx at lists.freedesktop.org
>>>>>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>> >>>>>
>>>>>> >>>>> Hi Emily,
>>>>>> >>>>>
>>>>>> >>>>> in this case you are on an old code branch.
>>>>>> >>>>>
>>>>>> >>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>> >>>>> timeout handler is running.
>>>>>> >>>>>
>>>>>> >>>>> See this patch here:
>>>>>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>> >>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>> >>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>>>> >>>>>>
>>>>>> >>>>>> drm/scheduler: rework job destruction
>>>>>> >>>>> Regards,
>>>>>> >>>>> Christian.
>>>>>> >>>>>
>>>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>> >>>>>> Hi Christian,
>>>>>> >>>>>> Please refer to follow log, when it enter to
>>>>>> >>>>>> amdgpu_device_gpu_recover
>>>>>> >>>>> function, the bad job 000000005086879e is freeing in function
>>>>>> >>>>> amdgpu_job_free_cb at the same time, because of the hardware fence
>>>>>> >>> signal.
>>>>>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>> >>>>> s_fence is already freed, but job is not freed in time. Then this issue
>>>>>> >occurs.
>>>>>> >>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>>> >>> sdma0
>>>>>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>>>>>> >>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>>>>> >>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163]
>>>>>> >>>>> amdgpu
>>>>>> >>>>> 0000:00:08.0: GPU reset begin!
>>>>>> >>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>> >>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794221]
>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>>>> >>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>>>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>>>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>>>> >>>>>> thread pid 0,
>>>>>> >>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>> >>>>> Emily:amdgpu_job_free_cb,Process
>>>>>> >>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 [
>>>>>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference
>>>>>> >>>>> at
>>>>>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops:
>>>>>> >>>>> 0000 [#1] SMP PTI
>>>>>> >>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE
>>>>>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>> >>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]
>>>>>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>> >>>>>> 449.803488]
>>>>>> >>> RIP:
>>>>>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>> >>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff
>>>>>> >>>>>> ff
>>>>>> >>>>>> 45 85 e4 0f
>>>>>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>> >>>>> <48> 8b
>>>>>> >>> 98
>>>>>> >>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>> >>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>> >>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>>>>>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>>>> >>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>>>>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000
>>>>>> >CR0:
>>>>>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>>>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>>>>>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
>>>>>> >>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> >>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>> >>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>> >>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>> >>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>> >>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>> >>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>>>>>> >>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>>>>>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>> >>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>>>> >>>>>>
>>>>>> >>>>>>> -----Original Message-----
>>>>>> >>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>> >>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>> >>> gfx at lists.freedesktop.org
>>>>>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>> >>>>>>> tdr
>>>>>> >>>>>>>
>>>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>> >>>>>>>> Sorry, please take your time.
>>>>>> >>>>>>> Have you seen my other response a bit below?
>>>>>> >>>>>>>
>>>>>> >>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>> >>>>>>> NULL without the job also being freed.
>>>>>> >>>>>>>
>>>>>> >>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Regards,
>>>>>> >>>>>>> Christian.
>>>>>> >>>>>>>
>>>>>> >>>>>>>> Best wishes
>>>>>> >>>>>>>> Emily Deng
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>> -----Original Message-----
>>>>>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>> >>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>> >>>>> gfx at lists.freedesktop.org
>>>>>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>> >>>>>>>>> tdr
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>> >>>>>>>>>> Ping.....
>>>>>> >>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>>> Best wishes
>>>>>> >>>>>>>>>> Emily Deng
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>> -----Original Message-----
>>>>>> >>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>> >>> Behalf
>>>>>> >>>>>>>>>>> Of Deng, Emily
>>>>>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>> >>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>> >>>>>>>>>>> for tdr
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>>> -----Original Message-----
>>>>>> >>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>> >>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>> >>>>>>>>>>>> for tdr
>>>>>> >>>>>>>>>>>>
>>>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>> >>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>> >>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>> >>>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will
>>>>>> >>>>>>>>>>> go to free
>>>>>> >>>>> job.
>>>>>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>> >>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>> >>>>>>>>> No, that case can't happen. See here:
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>> >>>>>>>>>> dma_fence_put(job->fence);
>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>> >>>>>>>>>> kfree(job);
>>>>>> >>>>>>>>> The job itself is freed up directly after freeing the reference
>>>>>> >>>>>>>>> to the
>>>>>> >>>>> s_fence.
>>>>>> >>>>>>>>> So you are just papering over a much bigger problem here. This
>>>>>> >>>>>>>>> patch is a clear NAK.
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Regards,
>>>>>> >>>>>>>>> Christian.
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>> >>>>>>>>>>>> problem is somewhere else.
>>>>>> >>>>>>>>>>>>
>>>>>> >>>>>>>>>>>> Regards,
>>>>>> >>>>>>>>>>>> Christian.
>>>>>> >>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>> >>>>>>>>>>>>> ---
>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++---
>>>>>> >--
>>>>>> >>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>> >>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>> >>> amdgpu_device_gpu_recover(struct
>>>>>> >>>>>>>>>>>> amdgpu_device *adev,
>>>>>> >>>>>>>>>>>>> *
>>>>>> >>>>>>>>>>>>> * job->base holds a reference to parent fence
>>>>>> >>>>>>>>>>>>> */
>>>>>> >>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>> >>>>>>>>>>>>> + if (job && job->base.s_fence &&
>>>>>> >>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>> >>>>>>> &&
>>>>>> >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>> >>>>>>>>>>>>> job_signaled = true;
>>>>>> >>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>> >drm_sched_increase_karma(struct
>>>>>> >>>>>>>>>>>> drm_sched_job
>>>>>> >>>>>>>>>>>>> *bad)
>>>>>> >>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(entity,
>>>>>> >>>>>>>>>>>>> tmp,
>>>>>> >>> &rq-
>>>>>> >>>>>>>> entities,
>>>>>> >>>>>>>>>>>> list) {
>>>>>> >>>>>>>>>>>>> - if (bad->s_fence->scheduled.context
>>>>>> >>>>>>> ==
>>>>>> >>>>>>>>>>>>> - entity->fence_context) {
>>>>>> >>>>>>>>>>>>> + if (bad->s_fence &&
>>>>>> >>>>>>>>>>>>> + (bad->s_fence-
>>>>>> >>>>>>>>>>>>> scheduled.context ==
>>>>>> >>>>>>>>>>>>> + entity->fence_context)) {
>>>>>> >>>>>>>>>>>>> if
>>>>>> >>>>>>>>>>>>> (atomic_read(&bad-
>>>>>> >>>>>>>> karma) >
>>>>>> >>>>>>>>>>>>> bad->sched-
>>>>>> >>>> hang_limit)
>>>>>> >>>>>>>>>>>>> if
>>>>>> >>>>>>>>>>>>> (entity-
>>>>>> >>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct
>>>>>> >>>>>>> drm_gpu_scheduler
>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>> >>>>>>>>>>>>> * This iteration is thread safe as sched thread
>>>>>> >>>>>>>>>>>>> is
>>>>>> >>> stopped.
>>>>>> >>>>>>>>>>>>> */
>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>> >>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>> >>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>> >>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>> >>>> s_fence-
>>>>>> >>>>>>>> parent,
>>>>>> >>>>>>>>>>>>> &s_job->cb)) {
>>>>>> >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>> >>> @@ -
>>>>>> >>>>>>> 395,7
>>>>>> >>>>>>>>>>> +395,8 @@ void
>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>> >>>>>>>>>>>>> *
>>>>>> >>>>>>>>>>>>> * Job is still alive so fence
>>>>>> >>>>>>>>>>>>> refcount at
>>>>>> >>> least 1
>>>>>> >>>>>>>>>>>>> */
>>>>>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>> >>>>>>> false);
>>>>>> >>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>> >>>>>>>> finished,
>>>>>> >>>>>>>>>>>> false);
>>>>>> >>>>>>>>>>>>> /*
>>>>>> >>>>>>>>>>>>> * We must keep bad job alive
>>>>>> >>>>>>>>>>>>> for later
>>>>>> >>> use
>>>>>> >>>>>>> during @@
>>>>>> >>>>>>>>>>>> -438,7
>>>>>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>>>> >>>>> *sched,
>>>>>> >>>>>>>>>>>>> +bool
>>>>>> >>>>>>>>>>>> full_recovery)
>>>>>> >>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>> >>>>>>>>>>>>> */
>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>> >>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>> >>>>>>>>>>>>> node)
>>>>>> >>>>>>>>>>>> {
>>>>>> >>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent;
>>>>>> >>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ?
>>>>>> >>>>>>>>>>>>> + s_job-
>>>>>> >>>>>>>> s_fence-
>>>>>> >>>>>>>>>>>>> parent :
>>>>>> >>>>>>>>>>>>> +NULL;
>>>>>> >>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>> >>>>>>>>>>>>>
>>>>>> >>>>>>>>>>> _______________________________________________
>>>>>> >>>>>>>>>>> amd-gfx mailing list
>>>>>> >>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>> >
>>>>>> >_______________________________________________
>>>>>> >amd-gfx mailing list
>>>>>> >amd-gfx at lists.freedesktop.org
>>>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191113/c2faacfa/attachment-0001.html>
More information about the amd-gfx
mailing list