[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Deng, Emily
Emily.Deng at amd.com
Fri Nov 8 10:22:36 UTC 2019
Hi Chrisitan,
No, I am with the new branch and also has the patch. Even it are freed by main scheduler, how we could avoid main scheduler to free jobs while enter to function amdgpu_device_gpu_recover?
Best wishes
Emily Deng
>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig at amd.com>
>Sent: Friday, November 8, 2019 6:15 PM
>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>
>Hi Emily,
>
>in this case you are on an old code branch.
>
>Jobs are freed now by the main scheduler thread and only if no timeout
>handler is running.
>
>See this patch here:
>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>> Author: Christian König <christian.koenig at amd.com>
>> Date: Thu Apr 18 11:00:21 2019 -0400
>>
>> drm/scheduler: rework job destruction
>
>Regards,
>Christian.
>
>Am 08.11.19 um 11:11 schrieb Deng, Emily:
>> Hi Christian,
>> Please refer to follow log, when it enter to amdgpu_device_gpu_recover
>function, the bad job 000000005086879e is freeing in function
>amdgpu_job_free_cb at the same time, because of the hardware fence signal.
>But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is
>already freed, but job is not freed in time. Then this issue occurs.
>>
>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163] amdgpu
>0000:00:08.0: GPU reset begin!
>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: process
>> pid 0 thread pid 0, s_job:000000005086879e [ 449.794221]
>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>> pid 0, s_job:0000000066eb74ab [ 449.794222]
>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>> pid 0, s_job:00000000d4438ad9 [ 449.794255]
>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>> pid 0, s_job:00000000b6d69c65 [ 449.794257]
>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread pid 0,
>s_job:00000000ea85e922 [ 449.794287] Emily:amdgpu_job_free_cb,Process
>information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6
>[ 449.794366] BUG: unable to handle kernel NULL pointer dereference at
>00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: 0000
>[#1] SMP PTI
>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE
>4.18.0-15-generic #16~18.04.1-Ubuntu
>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] Workqueue: events
>> drm_sched_job_timedout [amd_sched] [ 449.803488] RIP:
>0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f
>85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98
>c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 CR0:
>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>[ 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>0000000000000400 [ 449.811937] Call Trace:
>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>> kthread_create_worker_on_cpu+0x70/0x70
>> [ 449.815799] ret_from_fork+0x35/0x40
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Friday, November 8, 2019 5:43 PM
>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>> Sorry, please take your time.
>>> Have you seen my other response a bit below?
>>>
>>> I can't follow how it would be possible for job->s_fence to be NULL
>>> without the job also being freed.
>>>
>>> So it looks like this patch is just papering over some bigger issues.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>gfx at lists.freedesktop.org
>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>
>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>> Ping.....
>>>>> You need to give me at least enough time to wake up :)
>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf
>>>>>>> Of Deng, Emily
>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>> gfx at lists.freedesktop.org
>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>> tdr
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>> When the job is already signaled, the s_fence is freed. Then it
>>>>>>>>> will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>> See drm_sched_job_cleanup().
>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go to free
>job.
>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>>> time, job is not freed, but s_fence is already NULL.
>>>>> No, that case can't happen. See here:
>>>>>
>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>
>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>> dma_fence_put(job->fence);
>>>>>> amdgpu_sync_free(&job->sync);
>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>> kfree(job);
>>>>> The job itself is freed up directly after freeing the reference to the
>s_fence.
>>>>>
>>>>> So you are just papering over a much bigger problem here. This
>>>>> patch is a clear NAK.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>> problem is somewhere else.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++-----
>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>> *
>>>>>>>>> * job->base holds a reference to parent fence
>>>>>>>>> */
>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>> + if (job && job->base.s_fence && job->base.s_fence->parent
>>> &&
>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>> job_signaled = true;
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>>> drm_sched_job
>>>>>>>>> *bad)
>>>>>>>>>
>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>> list_for_each_entry_safe(entity, tmp, &rq-
>>>> entities,
>>>>>>>> list) {
>>>>>>>>> - if (bad->s_fence->scheduled.context
>>> ==
>>>>>>>>> - entity->fence_context) {
>>>>>>>>> + if (bad->s_fence && (bad->s_fence-
>>>>>>>>> scheduled.context ==
>>>>>>>>> + entity->fence_context)) {
>>>>>>>>> if (atomic_read(&bad-
>>>> karma) >
>>>>>>>>> bad->sched->hang_limit)
>>>>>>>>> if (entity->guilty)
>>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct
>>> drm_gpu_scheduler
>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>> * This iteration is thread safe as sched thread is stopped.
>>>>>>>>> */
>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>>> ring_mirror_list, node) {
>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>> dma_fence_remove_callback(s_job->s_fence-
>>>> parent,
>>>>>>>>> &s_job->cb)) {
>>>>>>>>> atomic_dec(&sched->hw_rq_count); @@ -
>>> 395,7
>>>>>>> +395,8 @@ void
>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>> *
>>>>>>>>> * Job is still alive so fence refcount at least 1
>>>>>>>>> */
>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>> false);
>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>> finished,
>>>>>>>> false);
>>>>>>>>> /*
>>>>>>>>> * We must keep bad job alive for later use
>>> during @@
>>>>>>>> -438,7
>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>*sched,
>>>>>>>>> +bool
>>>>>>>> full_recovery)
>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>> */
>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>> node)
>>>>>>>> {
>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? s_job-
>>>> s_fence-
>>>>>>>>> parent :
>>>>>>>>> +NULL;
>>>>>>>>>
>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list