[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Wed Nov 13 14:12:56 UTC 2019

This why I asked for a trace with timer enabled, but since there is a 
finite number of places we touch the timer Emily can just put prints 
there. Also, I wonder if this temp fix helps her with the issue or not.

Andrey

On 11/13/19 2:36 AM, Christian König wrote:
> The question is where do we rearm the timer for this problem to occur?
>
> Regards,
> Christian.
>
> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>
>> I was able to reproduce the crash by using the attached 
>> simulate_crash.patch - waiting on guilty job to signal in reset work 
>> and artificially rearming the timeout timer just before the check for 
>> !cancel_delayed_work(&sched->work_tdr)  in drm_sched_cleanup_jobs - 
>> crash log attached in crash.log. This I think confirms my theory i 
>> described earlier in this thread.
>>
>> basic_fix.patch handles this by testing whether another timer already 
>> armed ob this scheduler or is there a timeout work in execution right 
>> now (see documentation for work_busy) - obviously  this is not a full 
>> solution as this will not protect from races if for example there is 
>> immediate work scheduling such as in drm_sched_fault -  so we 
>> probably need to account for this by making drm_sched_cleanup_jobs 
>> (at least in the part where it iterates ring mirror list and frees 
>> jobs) and GPU reset really mutually exclusive and not like now.
>>
>> Andrey
>>
>>
>> On 11/11/19 4:11 PM, Christian König wrote:
>>> Hi Emily,
>>>
>>> you need to print which scheduler instance is freeing the jobs and 
>>> which one is triggering the reset. The TID and PID is completely 
>>> meaningless here since we are called from different worker threads 
>>> and the TID/PID can change on each call.
>>>
>>> Apart from that I will look into this a bit deeper when I have time.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>> Hi Christian,
>>>>     I add the follow print in function drm_sched_cleanup_jobs. From 
>>>> the log it shows that only use cancel_delayed_work could not avoid 
>>>> to free job when the sched is in reset. But don’t know exactly 
>>>> where it is wrong about the driver. Do you have any suggestion 
>>>> about this?
>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu, pid:%lu\n", 
>>>> current->tgid, current->pid);
>>>>         /*
>>>>          * Don't destroy jobs while the timeout worker is running  
>>>> OR thread
>>>>          * is being parked and hence assumed to not touch 
>>>> ring_mirror_list
>>>>          */
>>>>          if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>                 return;
>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n", 
>>>> current->tgid, current->pid);
>>>> Best wishes
>>>> Emily Deng
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring 
>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
>>>> information: process  pid 0 thread pid 0, 
>>>> s_job:00000000fe75ab36,tid=15603, pid=15603
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information: 
>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, tid:2262, 
>>>> pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253, pid:2253
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: 
>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information: 
>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, tid:2262, 
>>>> pid:2262
>>>> >-----Original Message-----
>>>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>> >Sent: Tuesday, November 12, 2019 11:28 AM
>>>> >To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>> ><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>> >
>>>> >Thinking more about this claim - we assume here that if cancel_delayed_work
>>>> >returned true it guarantees that timeout work is not running but, it merely
>>>> >means there was a pending timeout work which was removed from the
>>>> >workqueue before it's timer elapsed and so it didn't have a chance to be
>>>> >dequeued and executed, it doesn't cover already executing work. So there is a
>>>> >possibility where while timeout work started executing another timeout work
>>>> >already got enqueued (maybe through earlier cleanup jobs or through
>>>> >drm_sched_fault) and if at this point another drm_sched_cleanup_jobs runs
>>>> >cancel_delayed_work(&sched->work_tdr) will return true even while there is a
>>>> >timeout job in progress.
>>>> >Unfortunately we cannot change cancel_delayed_work to
>>>> >cancel_delayed_work_sync to flush the timeout work as timeout work itself
>>>> >waits for schedule thread  to be parked again when calling park_thread.
>>>> >
>>>> >Andrey
>>>> >
>>>> >________________________________________
>>>> >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>>>> >Koenig, Christian <Christian.Koenig at amd.com>
>>>> >Sent: 08 November 2019 05:35:18
>>>> >To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>> >
>>>> >Hi Emily,
>>>> >
>>>> >exactly that can't happen. See here:
>>>> >
>>>> >>         /* Don't destroy jobs while the timeout worker is running */
>>>> >>         if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>> >>            !cancel_delayed_work(&sched->work_tdr))
>>>> >>                 return NULL;
>>>> >
>>>> >We never free jobs while the timeout working is running to prevent exactly
>>>> >that issue.
>>>> >
>>>> >Regards,
>>>> >Christian.
>>>> >
>>>> >Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>> >> Hi Christian,
>>>> >>       The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>> >amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while
>>>> >in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>> >>
>>>> >> Best wishes
>>>> >> Emily Deng
>>>> >>
>>>> >>
>>>> >>
>>>> >>> -----Original Message-----
>>>> >>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> >>> Sent: Friday, November 8, 2019 6:26 PM
>>>> >>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>> >>>
>>>> >>> Hi Emily,
>>>> >>>
>>>> >>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>> >>>
>>>> >>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>>>> >>>
>>>> >>> Regards,
>>>> >>> Christian.
>>>> >>>
>>>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>> >>>> Hi Chrisitan,
>>>> >>>>        No, I am with the new branch and also has the patch. Even it
>>>> >>>> are freed by
>>>> >>> main scheduler, how we could avoid main scheduler to free jobs while
>>>> >>> enter to function amdgpu_device_gpu_recover?
>>>> >>>> Best wishes
>>>> >>>> Emily Deng
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>> -----Original Message-----
>>>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> >>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>> >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>> >gfx at lists.freedesktop.org
>>>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>> >>>>>
>>>> >>>>> Hi Emily,
>>>> >>>>>
>>>> >>>>> in this case you are on an old code branch.
>>>> >>>>>
>>>> >>>>> Jobs are freed now by the main scheduler thread and only if no
>>>> >>>>> timeout handler is running.
>>>> >>>>>
>>>> >>>>> See this patch here:
>>>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>> >>>>>> Author: Christian König <christian.koenig at amd.com>
>>>> >>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>> >>>>>>
>>>> >>>>>>       drm/scheduler: rework job destruction
>>>> >>>>> Regards,
>>>> >>>>> Christian.
>>>> >>>>>
>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>> >>>>>> Hi Christian,
>>>> >>>>>>         Please refer to follow log, when it enter to
>>>> >>>>>> amdgpu_device_gpu_recover
>>>> >>>>> function, the bad job 000000005086879e is freeing in function
>>>> >>>>> amdgpu_job_free_cb  at the same time, because of the hardware fence
>>>> >>> signal.
>>>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>> >>>>> s_fence is already freed, but job is not freed in time. Then this issue
>>>> >occurs.
>>>> >>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>> >>> sdma0
>>>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>> >>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>>> >>>>> process  pid 0 thread  pid 0, s_job:000000005086879e [  449.794163]
>>>> >>>>> amdgpu
>>>> >>>>> 0000:00:08.0: GPU reset begin!
>>>> >>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>> >>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [  449.794221]
>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>> >>>>>> thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>>>> >>>>>> thread pid 0,
>>>> >>>>> s_job:00000000ea85e922 [ 449.794287]
>>>> >>>>> Emily:amdgpu_job_free_cb,Process
>>>> >>>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6 [
>>>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference
>>>> >>>>> at
>>>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [  449.801040] Oops:
>>>> >>>>> 0000 [#1] SMP PTI
>>>> >>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>>>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>> >>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>> >>>>>> 449.803488]
>>>> >>> RIP:
>>>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>> >>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff
>>>> >>>>>> ff
>>>> >>>>>> 45 85 e4 0f
>>>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>> >>>>> <48> 8b
>>>> >>> 98
>>>> >>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>> >>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>> >>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>> >>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 0000 ES: 0000
>>>> >CR0:
>>>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
>>>> >>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> >>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>> >>>>>> [  449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>> >>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>> >>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>> >>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [  449.814417]
>>>> >>>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>> >>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>> >>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>> >>>>>>
>>>> >>>>>>> -----Original Message-----
>>>> >>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>> >>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>> >>> gfx at lists.freedesktop.org
>>>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>> >>>>>>> tdr
>>>> >>>>>>>
>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>> >>>>>>>> Sorry, please take your time.
>>>> >>>>>>> Have you seen my other response a bit below?
>>>> >>>>>>>
>>>> >>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>> >>>>>>> NULL without the job also being freed.
>>>> >>>>>>>
>>>> >>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>> >>>>>>>
>>>> >>>>>>> Regards,
>>>> >>>>>>> Christian.
>>>> >>>>>>>
>>>> >>>>>>>> Best wishes
>>>> >>>>>>>> Emily Deng
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>> -----Original Message-----
>>>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>> >>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>> >>>>> gfx at lists.freedesktop.org
>>>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>> >>>>>>>>> tdr
>>>> >>>>>>>>>
>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>> >>>>>>>>>> Ping.....
>>>> >>>>>>>>> You need to give me at least enough time to wake up :)
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Best wishes
>>>> >>>>>>>>>> Emily Deng
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>> -----Original Message-----
>>>> >>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On
>>>> >>> Behalf
>>>> >>>>>>>>>>> Of Deng, Emily
>>>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>>>> >>>>>>>>>>> gfx at lists.freedesktop.org
>>>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>> >>>>>>>>>>> for tdr
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> -----Original Message-----
>>>> >>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>> >>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>> >>>>>>>>>>>> for tdr
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>> >>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>> >>>>>>>>>>>> See drm_sched_job_cleanup().
>>>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>> >>>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will
>>>> >>>>>>>>>>> go to free
>>>> >>>>> job.
>>>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>> >>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>> >>>>>>>>> No, that case can't happen. See here:
>>>> >>>>>>>>>
>>>> >>>>>>>>>>            drm_sched_job_cleanup(s_job);
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>            amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>> >>>>>>>>>>            dma_fence_put(job->fence);
>>>> >>>>>>>>>>            amdgpu_sync_free(&job->sync);
>>>> >>>>>>>>>>            amdgpu_sync_free(&job->sched_sync);
>>>> >>>>>>>>>>            kfree(job);
>>>> >>>>>>>>> The job itself is freed up directly after freeing the reference
>>>> >>>>>>>>> to the
>>>> >>>>> s_fence.
>>>> >>>>>>>>> So you are just papering over a much bigger problem here. This
>>>> >>>>>>>>> patch is a clear NAK.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Regards,
>>>> >>>>>>>>> Christian.
>>>> >>>>>>>>>
>>>> >>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>> >>>>>>>>>>>> problem is somewhere else.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Regards,
>>>> >>>>>>>>>>>> Christian.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>> >>>>>>>>>>>>> ---
>>>> >>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>> >>>>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++---
>>>> >--
>>>> >>>>>>>>>>>>>       2 files changed, 7 insertions(+), 6 deletions(-)
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>> >>> amdgpu_device_gpu_recover(struct
>>>> >>>>>>>>>>>> amdgpu_device *adev,
>>>> >>>>>>>>>>>>>            *
>>>> >>>>>>>>>>>>>            * job->base holds a reference to parent fence
>>>> >>>>>>>>>>>>>            */
>>>> >>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>> >>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>> >>>>>>>>>>>>> + job->base.s_fence->parent
>>>> >>>>>>> &&
>>>> >>>>>>>>>>>>>               dma_fence_is_signaled(job->base.s_fence->parent))
>>>> >>>>>>>>>>>>>                   job_signaled = true;
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>> >drm_sched_increase_karma(struct
>>>> >>>>>>>>>>>> drm_sched_job
>>>> >>>>>>>>>>>>> *bad)
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>                           spin_lock(&rq->lock);
>>>> >>>>>>>>>>>>>                           list_for_each_entry_safe(entity,
>>>> >>>>>>>>>>>>> tmp,
>>>> >>> &rq-
>>>> >>>>>>>> entities,
>>>> >>>>>>>>>>>> list) {
>>>> >>>>>>>>>>>>> -                          if (bad->s_fence->scheduled.context
>>>> >>>>>>> ==
>>>> >>>>>>>>>>>>> - entity->fence_context) {
>>>> >>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>> >>>>>>>>>>>>> + (bad->s_fence-
>>>> >>>>>>>>>>>>> scheduled.context ==
>>>> >>>>>>>>>>>>> + entity->fence_context)) {
>>>> >>>>>>>>>>>>>                                           if
>>>> >>>>>>>>>>>>> (atomic_read(&bad-
>>>> >>>>>>>> karma) >
>>>> >>>>>>>>>>>>>                                               bad->sched-
>>>> >>>> hang_limit)
>>>> >>>>>>>>>>>>>                                                   if
>>>> >>>>>>>>>>>>> (entity-
>>>> >>>> guilty) @@ -376,7 +376,7 @@ void
>>>> >>>>>>>>>>>>> drm_sched_stop(struct
>>>> >>>>>>> drm_gpu_scheduler
>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>> >>>>>>>>>>>>>            * This iteration is thread safe as sched thread
>>>> >>>>>>>>>>>>> is
>>>> >>> stopped.
>>>> >>>>>>>>>>>>>            */
>>>> >>>>>>>>>>>>>           list_for_each_entry_safe_reverse(s_job, tmp,
>>>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>> >>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>> >>>>>>>>>>>>> +          if (s_job->s_fence && s_job->s_fence->parent &&
>>>> >>>>>>>>>>>>>                       dma_fence_remove_callback(s_job-
>>>> >>>> s_fence-
>>>> >>>>>>>> parent,
>>>> >>>>>>>>>>>>>                                                 &s_job->cb)) {
>>>> >>>>>>>>>>>>>                           atomic_dec(&sched->hw_rq_count);
>>>> >>> @@ -
>>>> >>>>>>> 395,7
>>>> >>>>>>>>>>> +395,8 @@ void
>>>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>> >>>>>>>>>>>>>                            *
>>>> >>>>>>>>>>>>>                            * Job is still alive so fence
>>>> >>>>>>>>>>>>> refcount at
>>>> >>> least 1
>>>> >>>>>>>>>>>>>                            */
>>>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>> >>>>>>> false);
>>>> >>>>>>>>>>>>> +                  if (s_job->s_fence)
>>>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>> >>>>>>>> finished,
>>>> >>>>>>>>>>>> false);
>>>> >>>>>>>>>>>>>                           /*
>>>> >>>>>>>>>>>>>                            * We must keep bad job alive
>>>> >>>>>>>>>>>>> for later
>>>> >>> use
>>>> >>>>>>> during @@
>>>> >>>>>>>>>>>> -438,7
>>>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>> >>>>> *sched,
>>>> >>>>>>>>>>>>> +bool
>>>> >>>>>>>>>>>> full_recovery)
>>>> >>>>>>>>>>>>>            * GPU recovers can't run in parallel.
>>>> >>>>>>>>>>>>>            */
>>>> >>>>>>>>>>>>>           list_for_each_entry_safe(s_job, tmp,
>>>> >>>>>>>>>>>>> &sched->ring_mirror_list,
>>>> >>>>>>>>>>>>> node)
>>>> >>>>>>>>>>>> {
>>>> >>>>>>>>>>>>> -          struct dma_fence *fence = s_job->s_fence->parent;
>>>> >>>>>>>>>>>>> +          struct dma_fence *fence = s_job->s_fence ?
>>>> >>>>>>>>>>>>> + s_job-
>>>> >>>>>>>> s_fence-
>>>> >>>>>>>>>>>>> parent :
>>>> >>>>>>>>>>>>> +NULL;
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>                   atomic_inc(&sched->hw_rq_count);
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>> _______________________________________________
>>>> >>>>>>>>>>> amd-gfx mailing list
>>>> >>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>> >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx 
>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>> >
>>>> >_______________________________________________
>>>> >amd-gfx mailing list
>>>> >amd-gfx at lists.freedesktop.org
>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191113/5776da9b/attachment-0001.html>