[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Fri Nov 8 10:26:19 UTC 2019

Hi Emily,

well who is calling amdgpu_device_gpu_recover() in this case?

When it's not the scheduler we shouldn't have a guilty job in the first 
place.

Regards,
Christian.

Am 08.11.19 um 11:22 schrieb Deng, Emily:
> Hi Chrisitan,
>       No, I am with the new branch and also has the patch. Even it are freed by main scheduler, how we could avoid main scheduler to free jobs while enter to function amdgpu_device_gpu_recover?
>
> Best wishes
> Emily Deng
>
>    
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Friday, November 8, 2019 6:15 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Emily,
>>
>> in this case you are on an old code branch.
>>
>> Jobs are freed now by the main scheduler thread and only if no timeout
>> handler is running.
>>
>> See this patch here:
>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>> Author: Christian König <christian.koenig at amd.com>
>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>
>>>      drm/scheduler: rework job destruction
>> Regards,
>> Christian.
>>
>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>> Hi Christian,
>>>        Please refer to follow log, when it enter to amdgpu_device_gpu_recover
>> function, the bad job 000000005086879e is freeing in function
>> amdgpu_job_free_cb  at the same time, because of the hardware fence signal.
>> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is
>> already freed, but job is not freed in time. Then this issue occurs.
>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>> process  pid 0 thread  pid 0, s_job:000000005086879e [  449.794163] amdgpu
>> 0000:00:08.0: GPU reset begin!
>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information: process
>>> pid 0 thread  pid 0, s_job:000000005086879e [  449.794221]
>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>> pid 0, s_job:0000000066eb74ab [  449.794222]
>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>> pid 0, s_job:00000000d4438ad9 [  449.794255]
>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>> pid 0, s_job:00000000b6d69c65 [  449.794257]
>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0,
>> s_job:00000000ea85e922 [  449.794287] Emily:amdgpu_job_free_cb,Process
>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
>> [  449.794366] BUG: unable to handle kernel NULL pointer dereference at
>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops: 0000
>> [#1] SMP PTI
>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944] Workqueue: events
>>> drm_sched_job_timedout [amd_sched] [  449.803488] RIP:
>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f
>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98
>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES: 0000 CR0:
>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400 [  449.811937] Call Trace:
>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>> kthread_create_worker_on_cpu+0x70/0x70
>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>> Sorry, please take your time.
>>>> Have you seen my other response a bit below?
>>>>
>>>> I can't follow how it would be possible for job->s_fence to be NULL
>>>> without the job also being freed.
>>>>
>>>> So it looks like this patch is just papering over some bigger issues.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>> Ping.....
>>>>>> You need to give me at least enough time to wake up :)
>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf
>>>>>>>> Of Deng, Emily
>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>> tdr
>>>>>>>>>
>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>> When the job is already signaled, the s_fence is freed. Then it
>>>>>>>>>> will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go to free
>> job.
>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>>>> time, job is not freed, but s_fence is already NULL.
>>>>>> No, that case can't happen. See here:
>>>>>>
>>>>>>>            drm_sched_job_cleanup(s_job);
>>>>>>>
>>>>>>>            amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>            dma_fence_put(job->fence);
>>>>>>>            amdgpu_sync_free(&job->sync);
>>>>>>>            amdgpu_sync_free(&job->sched_sync);
>>>>>>>            kfree(job);
>>>>>> The job itself is freed up directly after freeing the reference to the
>> s_fence.
>>>>>> So you are just papering over a much bigger problem here. This
>>>>>> patch is a clear NAK.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>> problem is somewhere else.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>>>>>>>>       2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>       	 *
>>>>>>>>>>       	 * job->base holds a reference to parent fence
>>>>>>>>>>       	 */
>>>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>>>> +	if (job && job->base.s_fence && job->base.s_fence->parent
>>>> &&
>>>>>>>>>>       	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>       		job_signaled = true;
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>>>> drm_sched_job
>>>>>>>>>> *bad)
>>>>>>>>>>
>>>>>>>>>>       			spin_lock(&rq->lock);
>>>>>>>>>>       			list_for_each_entry_safe(entity, tmp, &rq-
>>>>> entities,
>>>>>>>>> list) {
>>>>>>>>>> -				if (bad->s_fence->scheduled.context
>>>> ==
>>>>>>>>>> -				    entity->fence_context) {
>>>>>>>>>> +				if (bad->s_fence && (bad->s_fence-
>>>>>>>>>> scheduled.context ==
>>>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>>>       					if (atomic_read(&bad-
>>>>> karma) >
>>>>>>>>>>       					    bad->sched->hang_limit)
>>>>>>>>>>       						if (entity->guilty)
>>>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct
>>>> drm_gpu_scheduler
>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>       	 * This iteration is thread safe as sched thread is stopped.
>>>>>>>>>>       	 */
>>>>>>>>>>       	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>>>> ring_mirror_list, node) {
>>>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>>>       		    dma_fence_remove_callback(s_job->s_fence-
>>>>> parent,
>>>>>>>>>>       					      &s_job->cb)) {
>>>>>>>>>>       			atomic_dec(&sched->hw_rq_count); @@ -
>>>> 395,7
>>>>>>>> +395,8 @@ void
>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>       			 *
>>>>>>>>>>       			 * Job is still alive so fence refcount at least 1
>>>>>>>>>>       			 */
>>>>>>>>>> -			dma_fence_wait(&s_job->s_fence->finished,
>>>> false);
>>>>>>>>>> +			if (s_job->s_fence)
>>>>>>>>>> +				dma_fence_wait(&s_job->s_fence-
>>>>> finished,
>>>>>>>>> false);
>>>>>>>>>>       			/*
>>>>>>>>>>       			 * We must keep bad job alive for later use
>>>> during @@
>>>>>>>>> -438,7
>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>> *sched,
>>>>>>>>>> +bool
>>>>>>>>> full_recovery)
>>>>>>>>>>       	 * GPU recovers can't run in parallel.
>>>>>>>>>>       	 */
>>>>>>>>>>       	list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>> node)
>>>>>>>>> {
>>>>>>>>>> -		struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ? s_job-
>>>>> s_fence-
>>>>>>>>>> parent :
>>>>>>>>>> +NULL;
>>>>>>>>>>
>>>>>>>>>>       		atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx