[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Tue Nov 12 03:28:08 UTC 2019

Thinking more about this claim - we assume here that if cancel_delayed_work returned true it guarantees that timeout work is not running but, it merely means there was a pending timeout work which was removed from the workqueue before it's timer elapsed and so it didn't have a chance to be dequeued and executed, it doesn't cover already executing work. So there is a possibility where while timeout work started executing another timeout work already got enqueued (maybe through earlier cleanup jobs or through drm_sched_fault) and if at this point another drm_sched_cleanup_jobs runs cancel_delayed_work(&sched->work_tdr) will return true even while there is a timeout job in progress.
Unfortunately we cannot change cancel_delayed_work to cancel_delayed_work_sync to flush the timeout work as timeout work itself waits for schedule thread  to be parked again when calling park_thread.

Andrey

________________________________________
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Koenig, Christian <Christian.Koenig at amd.com>
Sent: 08 November 2019 05:35:18
To: Deng, Emily; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Hi Emily,

exactly that can't happen. See here:

>         /* Don't destroy jobs while the timeout worker is running */
>         if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>             !cancel_delayed_work(&sched->work_tdr))
>                 return NULL;

We never free jobs while the timeout working is running to prevent
exactly that issue.

Regards,
Christian.

Am 08.11.19 um 11:32 schrieb Deng, Emily:
> Hi Christian,
>       The drm_sched_job_timedout-> amdgpu_job_timedout call amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Friday, November 8, 2019 6:26 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Emily,
>>
>> well who is calling amdgpu_device_gpu_recover() in this case?
>>
>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>>
>> Regards,
>> Christian.
>>
>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>> Hi Chrisitan,
>>>        No, I am with the new branch and also has the patch. Even it are freed by
>> main scheduler, how we could avoid main scheduler to free jobs while enter
>> to function amdgpu_device_gpu_recover?
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Hi Emily,
>>>>
>>>> in this case you are on an old code branch.
>>>>
>>>> Jobs are freed now by the main scheduler thread and only if no
>>>> timeout handler is running.
>>>>
>>>> See this patch here:
>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>
>>>>>       drm/scheduler: rework job destruction
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>> Hi Christian,
>>>>>         Please refer to follow log, when it enter to
>>>>> amdgpu_device_gpu_recover
>>>> function, the bad job 000000005086879e is freeing in function
>>>> amdgpu_job_free_cb  at the same time, because of the hardware fence
>> signal.
>>>> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence
>>>> is already freed, but job is not freed in time. Then this issue occurs.
>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>> sdma0
>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>>> process  pid 0 thread  pid 0, s_job:000000005086879e [  449.794163]
>>>> amdgpu
>>>> 0000:00:08.0: GPU reset begin!
>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information: process
>>>>> pid 0 thread  pid 0, s_job:000000005086879e [  449.794221]
>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>>> pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>>> pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>>> pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>>> pid 0,
>>>> s_job:00000000ea85e922 [  449.794287]
>>>> Emily:amdgpu_job_free_cb,Process
>>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6 [
>>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference at
>>>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops: 0000
>>>> [#1] SMP PTI
>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [  449.803488]
>> RIP:
>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff
>>>>> 45 85 e4 0f
>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b
>> 98
>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES: 0000 CR0:
>>>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400 [  449.811937] Call Trace:
>>>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>> Sorry, please take your time.
>>>>>> Have you seen my other response a bit below?
>>>>>>
>>>>>> I can't follow how it would be possible for job->s_fence to be NULL
>>>>>> without the job also being freed.
>>>>>>
>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>> gfx at lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>> Ping.....
>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Emily Deng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On
>> Behalf
>>>>>>>>>> Of Deng, Emily
>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>>> tdr
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>
>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>> When the job is already signaled, the s_fence is freed. Then
>>>>>>>>>>>> it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go
>>>>>>>>>> to free
>>>> job.
>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>>>>>> time, job is not freed, but s_fence is already NULL.
>>>>>>>> No, that case can't happen. See here:
>>>>>>>>
>>>>>>>>>             drm_sched_job_cleanup(s_job);
>>>>>>>>>
>>>>>>>>>             amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>             dma_fence_put(job->fence);
>>>>>>>>>             amdgpu_sync_free(&job->sync);
>>>>>>>>>             amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>             kfree(job);
>>>>>>>> The job itself is freed up directly after freeing the reference
>>>>>>>> to the
>>>> s_fence.
>>>>>>>> So you are just papering over a much bigger problem here. This
>>>>>>>> patch is a clear NAK.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>>>        drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>>>>>>>>>>        2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>             *
>>>>>>>>>>>>             * job->base holds a reference to parent fence
>>>>>>>>>>>>             */
>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>>>>> +  if (job && job->base.s_fence && job->base.s_fence->parent
>>>>>> &&
>>>>>>>>>>>>                dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>                    job_signaled = true;
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>> *bad)
>>>>>>>>>>>>
>>>>>>>>>>>>                            spin_lock(&rq->lock);
>>>>>>>>>>>>                            list_for_each_entry_safe(entity, tmp,
>> &rq-
>>>>>>> entities,
>>>>>>>>>>> list) {
>>>>>>>>>>>> -                          if (bad->s_fence->scheduled.context
>>>>>> ==
>>>>>>>>>>>> -                              entity->fence_context) {
>>>>>>>>>>>> +                          if (bad->s_fence && (bad->s_fence-
>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>> +                              entity->fence_context)) {
>>>>>>>>>>>>                                            if (atomic_read(&bad-
>>>>>>> karma) >
>>>>>>>>>>>>                                                bad->sched-
>>> hang_limit)
>>>>>>>>>>>>                                                    if (entity-
>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>> drm_sched_stop(struct
>>>>>> drm_gpu_scheduler
>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>             * This iteration is thread safe as sched thread is
>> stopped.
>>>>>>>>>>>>             */
>>>>>>>>>>>>            list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>>>>>> ring_mirror_list, node) {
>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>>>>> +          if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>>>>>                        dma_fence_remove_callback(s_job-
>>> s_fence-
>>>>>>> parent,
>>>>>>>>>>>>                                                  &s_job->cb)) {
>>>>>>>>>>>>                            atomic_dec(&sched->hw_rq_count);
>> @@ -
>>>>>> 395,7
>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>                             *
>>>>>>>>>>>>                             * Job is still alive so fence refcount at
>> least 1
>>>>>>>>>>>>                             */
>>>>>>>>>>>> -                  dma_fence_wait(&s_job->s_fence->finished,
>>>>>> false);
>>>>>>>>>>>> +                  if (s_job->s_fence)
>>>>>>>>>>>> +                          dma_fence_wait(&s_job->s_fence-
>>>>>>> finished,
>>>>>>>>>>> false);
>>>>>>>>>>>>                            /*
>>>>>>>>>>>>                             * We must keep bad job alive for later
>> use
>>>>>> during @@
>>>>>>>>>>> -438,7
>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>> *sched,
>>>>>>>>>>>> +bool
>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>             * GPU recovers can't run in parallel.
>>>>>>>>>>>>             */
>>>>>>>>>>>>            list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>> node)
>>>>>>>>>>> {
>>>>>>>>>>>> -          struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>>>>>> +          struct dma_fence *fence = s_job->s_fence ? s_job-
>>>>>>> s_fence-
>>>>>>>>>>>> parent :
>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>
>>>>>>>>>>>>                    atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx