[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Koenig, Christian
Christian.Koenig at amd.com
Fri Nov 8 10:26:19 UTC 2019
Hi Emily,
well who is calling amdgpu_device_gpu_recover() in this case?
When it's not the scheduler we shouldn't have a guilty job in the first
place.
Regards,
Christian.
Am 08.11.19 um 11:22 schrieb Deng, Emily:
> Hi Chrisitan,
> No, I am with the new branch and also has the patch. Even it are freed by main scheduler, how we could avoid main scheduler to free jobs while enter to function amdgpu_device_gpu_recover?
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Friday, November 8, 2019 6:15 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Emily,
>>
>> in this case you are on an old code branch.
>>
>> Jobs are freed now by the main scheduler thread and only if no timeout
>> handler is running.
>>
>> See this patch here:
>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>> Author: Christian König <christian.koenig at amd.com>
>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>
>>> drm/scheduler: rework job destruction
>> Regards,
>> Christian.
>>
>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>> Hi Christian,
>>> Please refer to follow log, when it enter to amdgpu_device_gpu_recover
>> function, the bad job 000000005086879e is freeing in function
>> amdgpu_job_free_cb at the same time, because of the hardware fence signal.
>> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is
>> already freed, but job is not freed in time. Then this issue occurs.
>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163] amdgpu
>> 0000:00:08.0: GPU reset begin!
>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: process
>>> pid 0 thread pid 0, s_job:000000005086879e [ 449.794221]
>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>>> pid 0, s_job:0000000066eb74ab [ 449.794222]
>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>>> pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread
>>> pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread pid 0,
>> s_job:00000000ea85e922 [ 449.794287] Emily:amdgpu_job_free_cb,Process
>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6
>> [ 449.794366] BUG: unable to handle kernel NULL pointer dereference at
>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: 0000
>> [#1] SMP PTI
>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE
>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] Workqueue: events
>>> drm_sched_job_timedout [amd_sched] [ 449.803488] RIP:
>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f
>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98
>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 CR0:
>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400 [ 449.811937] Call Trace:
>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>>> kthread_create_worker_on_cpu+0x70/0x70
>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>> Sorry, please take your time.
>>>> Have you seen my other response a bit below?
>>>>
>>>> I can't follow how it would be possible for job->s_fence to be NULL
>>>> without the job also being freed.
>>>>
>>>> So it looks like this patch is just papering over some bigger issues.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>> Ping.....
>>>>>> You need to give me at least enough time to wake up :)
>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf
>>>>>>>> Of Deng, Emily
>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>> tdr
>>>>>>>>>
>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>> When the job is already signaled, the s_fence is freed. Then it
>>>>>>>>>> will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go to free
>> job.
>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>>>> time, job is not freed, but s_fence is already NULL.
>>>>>> No, that case can't happen. See here:
>>>>>>
>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>
>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>> dma_fence_put(job->fence);
>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>> kfree(job);
>>>>>> The job itself is freed up directly after freeing the reference to the
>> s_fence.
>>>>>> So you are just papering over a much bigger problem here. This
>>>>>> patch is a clear NAK.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>> problem is somewhere else.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++-----
>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>> *
>>>>>>>>>> * job->base holds a reference to parent fence
>>>>>>>>>> */
>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>>> + if (job && job->base.s_fence && job->base.s_fence->parent
>>>> &&
>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>> job_signaled = true;
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>>>> drm_sched_job
>>>>>>>>>> *bad)
>>>>>>>>>>
>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>> list_for_each_entry_safe(entity, tmp, &rq-
>>>>> entities,
>>>>>>>>> list) {
>>>>>>>>>> - if (bad->s_fence->scheduled.context
>>>> ==
>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>> + if (bad->s_fence && (bad->s_fence-
>>>>>>>>>> scheduled.context ==
>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>> if (atomic_read(&bad-
>>>>> karma) >
>>>>>>>>>> bad->sched->hang_limit)
>>>>>>>>>> if (entity->guilty)
>>>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct
>>>> drm_gpu_scheduler
>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>> * This iteration is thread safe as sched thread is stopped.
>>>>>>>>>> */
>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>>>> ring_mirror_list, node) {
>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>>> dma_fence_remove_callback(s_job->s_fence-
>>>>> parent,
>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>> atomic_dec(&sched->hw_rq_count); @@ -
>>>> 395,7
>>>>>>>> +395,8 @@ void
>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>> *
>>>>>>>>>> * Job is still alive so fence refcount at least 1
>>>>>>>>>> */
>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>> false);
>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>> finished,
>>>>>>>>> false);
>>>>>>>>>> /*
>>>>>>>>>> * We must keep bad job alive for later use
>>>> during @@
>>>>>>>>> -438,7
>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>> *sched,
>>>>>>>>>> +bool
>>>>>>>>> full_recovery)
>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>> */
>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>> node)
>>>>>>>>> {
>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? s_job-
>>>>> s_fence-
>>>>>>>>>> parent :
>>>>>>>>>> +NULL;
>>>>>>>>>>
>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list