[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Fri Nov 8 10:14:41 UTC 2019

Hi Emily,

in this case you are on an old code branch.

Jobs are freed now by the main scheduler thread and only if no timeout 
handler is running.

See this patch here:
> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
> Author: Christian König <christian.koenig at amd.com>
> Date:   Thu Apr 18 11:00:21 2019 -0400
>
>     drm/scheduler: rework job destruction

Regards,
Christian.

Am 08.11.19 um 11:11 schrieb Deng, Emily:
> Hi Christian,
>       Please refer to follow log, when it enter to amdgpu_device_gpu_recover function, the bad job 000000005086879e is freeing in function  amdgpu_job_free_cb  at the same time, because of the hardware fence signal. But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is already freed, but job is not freed in time. Then this issue occurs.
>
> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2481, emitted seq=2483
> [  449.793202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0, s_job:000000005086879e
> [  449.794163] amdgpu 0000:00:08.0: GPU reset begin!
> [  449.794175] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:000000005086879e
> [  449.794221] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:0000000066eb74ab
> [  449.794222] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000d4438ad9
> [  449.794255] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000b6d69c65
> [  449.794257] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000ea85e922
> [  449.794287] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
> [  449.794366] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
> [  449.800818] PGD 0 P4D 0
> [  449.801040] Oops: 0000 [#1] SMP PTI
> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE     4.18.0-15-generic #16~18.04.1-Ubuntu
> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [  449.802944] Workqueue: events drm_sched_job_timedout [amd_sched]
> [  449.803488] RIP: 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98 c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286
> [  449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [  449.806625] RDX: ffffb4c7c08f5ac0 RSI: 0000000fffffffe0 RDI: 0000000000000246
> [  449.807224] RBP: ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000
> [  449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: 0000000000000000
> [  449.808411] R13: ffffb4c7c08f7da0 R14: ffff8d82b8525d40 R15: ffff8d82b8525d40
> [  449.809004] FS:  0000000000000000(0000) GS:ffff8d82bfd80000(0000) knlGS:0000000000000000
> [  449.809674] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  449.810153] CR2: 00000000000000c0 CR3: 000000003cc0a001 CR4: 00000000003606e0
> [  449.810747] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  449.811937] Call Trace:
> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu]
> [  449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched]
> [  449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu]
> [  449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched]
> [  449.814077]  process_one_work+0x1fd/0x3f0
> [  449.814417]  worker_thread+0x34/0x410
> [  449.814728]  kthread+0x121/0x140
> [  449.815004]  ? process_one_work+0x3f0/0x3f0
> [  449.815374]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  449.815799]  ret_from_fork+0x35/0x40
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Friday, November 8, 2019 5:43 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>> Sorry, please take your time.
>> Have you seen my other response a bit below?
>>
>> I can't follow how it would be possible for job->s_fence to be NULL without
>> the job also being freed.
>>
>> So it looks like this patch is just papering over some bigger issues.
>>
>> Regards,
>> Christian.
>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>> Ping.....
>>>> You need to give me at least enough time to wake up :)
>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>>>> Deng, Emily
>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>> gfx at lists.freedesktop.org
>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>> tdr
>>>>>>>
>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>> When the job is already signaled, the s_fence is freed. Then it
>>>>>>>> will has null pointer in amdgpu_device_gpu_recover.
>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>> See drm_sched_job_cleanup().
>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one case,
>>>>>> when it enter into the amdgpu_device_gpu_recover, it already in
>>>>>> drm_sched_job_cleanup, and at this time, it will go to free job.
>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>> time, job is not freed, but s_fence is already NULL.
>>>> No, that case can't happen. See here:
>>>>
>>>>>           drm_sched_job_cleanup(s_job);
>>>>>
>>>>>           amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>           dma_fence_put(job->fence);
>>>>>           amdgpu_sync_free(&job->sync);
>>>>>           amdgpu_sync_free(&job->sched_sync);
>>>>>           kfree(job);
>>>> The job itself is freed up directly after freeing the reference to the s_fence.
>>>>
>>>> So you are just papering over a much bigger problem here. This patch
>>>> is a clear NAK.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>>> When you see a job without an s_fence then that means the problem
>>>>>>> is somewhere else.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>>>>>>      2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>>      	 *
>>>>>>>>      	 * job->base holds a reference to parent fence
>>>>>>>>      	 */
>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>> +	if (job && job->base.s_fence && job->base.s_fence->parent
>> &&
>>>>>>>>      	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>      		job_signaled = true;
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>> drm_sched_job
>>>>>>>> *bad)
>>>>>>>>
>>>>>>>>      			spin_lock(&rq->lock);
>>>>>>>>      			list_for_each_entry_safe(entity, tmp, &rq-
>>> entities,
>>>>>>> list) {
>>>>>>>> -				if (bad->s_fence->scheduled.context
>> ==
>>>>>>>> -				    entity->fence_context) {
>>>>>>>> +				if (bad->s_fence && (bad->s_fence-
>>>>>>>> scheduled.context ==
>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>      					if (atomic_read(&bad-
>>> karma) >
>>>>>>>>      					    bad->sched->hang_limit)
>>>>>>>>      						if (entity->guilty)
>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct
>> drm_gpu_scheduler
>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>      	 * This iteration is thread safe as sched thread is stopped.
>>>>>>>>      	 */
>>>>>>>>      	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>> ring_mirror_list, node) {
>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>      		    dma_fence_remove_callback(s_job->s_fence-
>>> parent,
>>>>>>>>      					      &s_job->cb)) {
>>>>>>>>      			atomic_dec(&sched->hw_rq_count); @@ -
>> 395,7
>>>>>> +395,8 @@ void
>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>      			 *
>>>>>>>>      			 * Job is still alive so fence refcount at least 1
>>>>>>>>      			 */
>>>>>>>> -			dma_fence_wait(&s_job->s_fence->finished,
>> false);
>>>>>>>> +			if (s_job->s_fence)
>>>>>>>> +				dma_fence_wait(&s_job->s_fence-
>>> finished,
>>>>>>> false);
>>>>>>>>      			/*
>>>>>>>>      			 * We must keep bad job alive for later use
>> during @@
>>>>>>> -438,7
>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched,
>>>>>>>> +bool
>>>>>>> full_recovery)
>>>>>>>>      	 * GPU recovers can't run in parallel.
>>>>>>>>      	 */
>>>>>>>>      	list_for_each_entry_safe(s_job, tmp,
>>>>>>>> &sched->ring_mirror_list,
>>>>>>>> node)
>>>>>>> {
>>>>>>>> -		struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ? s_job-
>>> s_fence-
>>>>>>>> parent :
>>>>>>>> +NULL;
>>>>>>>>
>>>>>>>>      		atomic_inc(&sched->hw_rq_count);
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx