[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Mon Nov 11 18:06:21 UTC 2019

Note that kthread_park waits for kthread->parked to be signaled before 
proceeding - so in the scenario you described it meas main thread is 
running (not parked and so kthread->parked is not signaled) and so 
kthread_park will not proceed until the sched thread finish current loop 
(including removing any signaled jobs from ring_mirror_list) and is back 
to 
wait_event_interruptible->drm_sched_blocked->kthread_parkme->complete(&self->parked) 
to park itself - so looks to me it should be OK.

Andrey

On 11/11/19 2:19 AM, Deng, Emily wrote:
> Hi Andrey,
>      I don’t think your patch will help for this. As it will may call kthread_should_park in drm_sched_cleanup_jobs first, and then call kcl_kthread_park. And then it still has a race between the 2 threads.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> Sent: Saturday, November 9, 2019 3:01 AM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>>
>> On 11/8/19 5:35 AM, Koenig, Christian wrote:
>>> Hi Emily,
>>>
>>> exactly that can't happen. See here:
>>>
>>>>           /* Don't destroy jobs while the timeout worker is running */
>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>               !cancel_delayed_work(&sched->work_tdr))
>>>>                   return NULL;
>>> We never free jobs while the timeout working is running to prevent
>>> exactly that issue.
>>
>> I don't think this protects us if drm_sched_cleanup_jobs is called for scheduler
>> which didn't experience a timeout, in amdgpu_device_gpu_recover we access
>> sched->ring_mirror_list for all the schedulers on a device so this condition
>> above won't protect us. What in fact could help maybe is my recent patch
>> 541c521 drm/sched: Avoid job cleanup if sched thread is parked. because we
>> do park each of the scheduler threads during tdr job before trying to access
>> sched->ring_mirror_list.
>>
>> Emily - did you see this problem with that patch in place ? I only pushed it
>> yesterday.
>>
>> Andrey
>>
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>> Hi Christian,
>>>>         The drm_sched_job_timedout-> amdgpu_job_timedout call
>> amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while
>> in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>
>>>>> Hi Emily,
>>>>>
>>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>
>>>>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>> Hi Chrisitan,
>>>>>>          No, I am with the new branch and also has the patch. Even
>>>>>> it are freed by
>>>>> main scheduler, how we could avoid main scheduler to free jobs while
>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>> tdr
>>>>>>>
>>>>>>> Hi Emily,
>>>>>>>
>>>>>>> in this case you are on an old code branch.
>>>>>>>
>>>>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>>> timeout handler is running.
>>>>>>>
>>>>>>> See this patch here:
>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>
>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>> Hi Christian,
>>>>>>>>           Please refer to follow log, when it enter to
>>>>>>>> amdgpu_device_gpu_recover
>>>>>>> function, the bad job 000000005086879e is freeing in function
>>>>>>> amdgpu_job_free_cb  at the same time, because of the hardware
>>>>>>> fence
>>>>> signal.
>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>>> s_fence is already freed, but job is not freed in time. Then this issue
>> occurs.
>>>>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>> sdma0
>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>>>>>> process  pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>> 449.794163] amdgpu
>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>> pid 0 thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>>> thread pid 0,
>>>>>>> s_job:00000000ea85e922 [  449.794287]
>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
>>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>> dereference at
>>>>>>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops:
>>>>>>> 0000 [#1] SMP PTI
>>>>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>>> 449.803488]
>>>>> RIP:
>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff
>>>>>>>> ff
>>>>>>>> 45 85 e4 0f
>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>>> <48> 8b
>>>>> 98
>>>>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES: 0000
>> CR0:
>>>>>>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [
>>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>> 0000000000000400 [  449.811937] Call Trace:
>>>>>>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>> gfx at lists.freedesktop.org
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>> tdr
>>>>>>>>>
>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>> Sorry, please take your time.
>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>
>>>>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>
>>>>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>
>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>> Ping.....
>>>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>>>
>>>>>>>>>>>> Best wishes
>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On
>>>>> Behalf
>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in
>>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover,
>>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it
>>>>>>>>>>>>> will go to free
>>>>>>> job.
>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>
>>>>>>>>>>>>               drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>
>>>>>>>>>>>>               amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>               dma_fence_put(job->fence);
>>>>>>>>>>>>               amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>               amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>               kfree(job);
>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>>> reference to the
>>>>>>> s_fence.
>>>>>>>>>>> So you are just papering over a much bigger problem here. This
>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>>>>>>          drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-
>> ----
>>>>>>>>>>>>>>>          2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>          	 *
>>>>>>>>>>>>>>>          	 * job->base holds a reference to parent fence
>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>> +	if (job && job->base.s_fence &&
>>>>>>>>>>>>>>> +job->base.s_fence->parent
>>>>>>>>> &&
>>>>>>>>>>>>>>>          	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>>>>          		job_signaled = true;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          			spin_lock(&rq->lock);
>>>>>>>>>>>>>>>          			list_for_each_entry_safe(entity, tmp,
>>>>> &rq-
>>>>>>>>>> entities,
>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>> -				if (bad->s_fence-
>>> scheduled.context
>>>>>>>>> ==
>>>>>>>>>>>>>>> -				    entity->fence_context) {
>>>>>>>>>>>>>>> +				if (bad->s_fence && (bad-
>>> s_fence-
>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>>>>>>>>          					if (atomic_read(&bad-
>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>          					    bad->sched-
>>>>>> hang_limit)
>>>>>>>>>>>>>>>          						if (entity-
>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>          	 * This iteration is thread safe as sched thread
>>>>>>>>>>>>>>> is
>>>>> stopped.
>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>>          	list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent
>> &&
>>>>>>>>>>>>>>>          		    dma_fence_remove_callback(s_job-
>>>>>> s_fence-
>>>>>>>>>> parent,
>>>>>>>>>>>>>>>          					      &s_job->cb)) {
>>>>>>>>>>>>>>>          			atomic_dec(&sched->hw_rq_count);
>>>>> @@ -
>>>>>>>>> 395,7
>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>          			 *
>>>>>>>>>>>>>>>          			 * Job is still alive so fence refcount at
>>>>> least 1
>>>>>>>>>>>>>>>          			 */
>>>>>>>>>>>>>>> -			dma_fence_wait(&s_job->s_fence-
>>> finished,
>>>>>>>>> false);
>>>>>>>>>>>>>>> +			if (s_job->s_fence)
>>>>>>>>>>>>>>> +				dma_fence_wait(&s_job-
>>> s_fence-
>>>>>>>>>> finished,
>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>          			/*
>>>>>>>>>>>>>>>          			 * We must keep bad job alive for later
>>>>> use
>>>>>>>>> during @@
>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>>>>> *sched,
>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>          	 * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>>          	list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> -		struct dma_fence *fence = s_job->s_fence-
>>> parent;
>>>>>>>>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ?
>> s_job-
>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          		atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx