[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Tue Nov 12 19:21:21 UTC 2019
I was able to reproduce the crash by using the attached
simulate_crash.patch - waiting on guilty job to signal in reset work and
artificially rearming the timeout timer just before the check for
!cancel_delayed_work(&sched->work_tdr) in drm_sched_cleanup_jobs -
crash log attached in crash.log. This I think confirms my theory i
described earlier in this thread.
basic_fix.patch handles this by testing whether another timer already
armed ob this scheduler or is there a timeout work in execution right
now (see documentation for work_busy) - obviously this is not a full
solution as this will not protect from races if for example there is
immediate work scheduling such as in drm_sched_fault - so we probably
need to account for this by making drm_sched_cleanup_jobs (at least in
the part where it iterates ring mirror list and frees jobs) and GPU
reset really mutually exclusive and not like now.
Andrey
On 11/11/19 4:11 PM, Christian König wrote:
> Hi Emily,
>
> you need to print which scheduler instance is freeing the jobs and
> which one is triggering the reset. The TID and PID is completely
> meaningless here since we are called from different worker threads and
> the TID/PID can change on each call.
>
> Apart from that I will look into this a bit deeper when I have time.
>
> Regards,
> Christian.
>
> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>> Hi Christian,
>> I add the follow print in function drm_sched_cleanup_jobs. From
>> the log it shows that only use cancel_delayed_work could not avoid to
>> free job when the sched is in reset. But don’t know exactly where it
>> is wrong about the driver. Do you have any suggestion about this?
>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu, pid:%lu\n",
>> current->tgid, current->pid);
>> /*
>> * Don't destroy jobs while the timeout worker is running OR
>> thread
>> * is being parked and hence assumed to not touch
>> ring_mirror_list
>> */
>> if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> !cancel_delayed_work(&sched->work_tdr)))
>> return;
>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>> current->tgid, current->pid);
>> Best wishes
>> Emily Deng
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>> timeout, signaled seq=78585, emitted seq=78587
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information: process pid 0 thread pid 0,
>> s_job:00000000fe75ab36,tid=15603, pid=15603
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225428] Emily:amdgpu_job_free_cb,Process information: process
>> pid 0 thread pid 0, s_job:00000000fe75ab36, tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253, pid:2253
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>> [11381.225494] Emily:amdgpu_job_free_cb,Process information: process
>> pid 0 thread pid 0, s_job:00000000f086ec84, tid:2262, pid:2262
>> >-----Original Message-----
>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> >Sent: Tuesday, November 12, 2019 11:28 AM
>> >To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>> ><Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>> >
>> >Thinking more about this claim - we assume here that if cancel_delayed_work
>> >returned true it guarantees that timeout work is not running but, it merely
>> >means there was a pending timeout work which was removed from the
>> >workqueue before it's timer elapsed and so it didn't have a chance to be
>> >dequeued and executed, it doesn't cover already executing work. So there is a
>> >possibility where while timeout work started executing another timeout work
>> >already got enqueued (maybe through earlier cleanup jobs or through
>> >drm_sched_fault) and if at this point another drm_sched_cleanup_jobs runs
>> >cancel_delayed_work(&sched->work_tdr) will return true even while there is a
>> >timeout job in progress.
>> >Unfortunately we cannot change cancel_delayed_work to
>> >cancel_delayed_work_sync to flush the timeout work as timeout work itself
>> >waits for schedule thread to be parked again when calling park_thread.
>> >
>> >Andrey
>> >
>> >________________________________________
>> >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>> >Koenig, Christian <Christian.Koenig at amd.com>
>> >Sent: 08 November 2019 05:35:18
>> >To: Deng, Emily; amd-gfx at lists.freedesktop.org
>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>> >
>> >Hi Emily,
>> >
>> >exactly that can't happen. See here:
>> >
>> >> /* Don't destroy jobs while the timeout worker is running */
>> >> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> >> !cancel_delayed_work(&sched->work_tdr))
>> >> return NULL;
>> >
>> >We never free jobs while the timeout working is running to prevent exactly
>> >that issue.
>> >
>> >Regards,
>> >Christian.
>> >
>> >Am 08.11.19 um 11:32 schrieb Deng, Emily:
>> >> Hi Christian,
>> >> The drm_sched_job_timedout-> amdgpu_job_timedout call
>> >amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while
>> >in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>> >>
>> >> Best wishes
>> >> Emily Deng
>> >>
>> >>
>> >>
>> >>> -----Original Message-----
>> >>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> >>> Sent: Friday, November 8, 2019 6:26 PM
>> >>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>> >>>
>> >>> Hi Emily,
>> >>>
>> >>> well who is calling amdgpu_device_gpu_recover() in this case?
>> >>>
>> >>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>> >>>
>> >>> Regards,
>> >>> Christian.
>> >>>
>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>> >>>> Hi Chrisitan,
>> >>>> No, I am with the new branch and also has the patch. Even it
>> >>>> are freed by
>> >>> main scheduler, how we could avoid main scheduler to free jobs while
>> >>> enter to function amdgpu_device_gpu_recover?
>> >>>> Best wishes
>> >>>> Emily Deng
>> >>>>
>> >>>>
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> >>>>> Sent: Friday, November 8, 2019 6:15 PM
>> >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> >gfx at lists.freedesktop.org
>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>> >>>>>
>> >>>>> Hi Emily,
>> >>>>>
>> >>>>> in this case you are on an old code branch.
>> >>>>>
>> >>>>> Jobs are freed now by the main scheduler thread and only if no
>> >>>>> timeout handler is running.
>> >>>>>
>> >>>>> See this patch here:
>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>> >>>>>> Author: Christian König <christian.koenig at amd.com>
>> >>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>> >>>>>>
>> >>>>>> drm/scheduler: rework job destruction
>> >>>>> Regards,
>> >>>>> Christian.
>> >>>>>
>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>> >>>>>> Hi Christian,
>> >>>>>> Please refer to follow log, when it enter to
>> >>>>>> amdgpu_device_gpu_recover
>> >>>>> function, the bad job 000000005086879e is freeing in function
>> >>>>> amdgpu_job_free_cb at the same time, because of the hardware fence
>> >>> signal.
>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>> >>>>> s_fence is already freed, but job is not freed in time. Then this issue
>> >occurs.
>> >>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>> >>> sdma0
>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>> >>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>> >>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163]
>> >>>>> amdgpu
>> >>>>> 0000:00:08.0: GPU reset begin!
>> >>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information:
>> >>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794221]
>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>> >>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0
>> >>>>>> thread pid 0,
>> >>>>> s_job:00000000ea85e922 [ 449.794287]
>> >>>>> Emily:amdgpu_job_free_cb,Process
>> >>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 [
>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference
>> >>>>> at
>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops:
>> >>>>> 0000 [#1] SMP PTI
>> >>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE
>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>> >>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]
>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>> >>>>>> 449.803488]
>> >>> RIP:
>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>> >>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff
>> >>>>>> ff
>> >>>>>> 45 85 e4 0f
>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>> >>>>> <48> 8b
>> >>> 98
>> >>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>> >>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>> >>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>> >>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000
>> >CR0:
>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
>> >>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> >>>>> 0000000000000400 [ 449.811937] Call Trace:
>> >>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>> >>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>> >>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>> >>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>> >>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>> >>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70
>> >>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>> >>>>>>
>> >>>>>>> -----Original Message-----
>> >>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>> >>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> >>> gfx at lists.freedesktop.org
>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>> >>>>>>> tdr
>> >>>>>>>
>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>> >>>>>>>> Sorry, please take your time.
>> >>>>>>> Have you seen my other response a bit below?
>> >>>>>>>
>> >>>>>>> I can't follow how it would be possible for job->s_fence to be
>> >>>>>>> NULL without the job also being freed.
>> >>>>>>>
>> >>>>>>> So it looks like this patch is just papering over some bigger issues.
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Christian.
>> >>>>>>>
>> >>>>>>>> Best wishes
>> >>>>>>>> Emily Deng
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> -----Original Message-----
>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>> >>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>> >>>>> gfx at lists.freedesktop.org
>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>> >>>>>>>>> tdr
>> >>>>>>>>>
>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>> >>>>>>>>>> Ping.....
>> >>>>>>>>> You need to give me at least enough time to wake up :)
>> >>>>>>>>>
>> >>>>>>>>>> Best wishes
>> >>>>>>>>>> Emily Deng
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> -----Original Message-----
>> >>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On
>> >>> Behalf
>> >>>>>>>>>>> Of Deng, Emily
>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>> >>>>>>>>>>> gfx at lists.freedesktop.org
>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>> >>>>>>>>>>> for tdr
>> >>>>>>>>>>>
>> >>>>>>>>>>>> -----Original Message-----
>> >>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>> >>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>> >>>>>>>>>>>> for tdr
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>> >>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>> >>>>>>>>>>>> See drm_sched_job_cleanup().
>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>> >>>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will
>> >>>>>>>>>>> go to free
>> >>>>> job.
>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>> >>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>> >>>>>>>>> No, that case can't happen. See here:
>> >>>>>>>>>
>> >>>>>>>>>> drm_sched_job_cleanup(s_job);
>> >>>>>>>>>>
>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>> >>>>>>>>>> dma_fence_put(job->fence);
>> >>>>>>>>>> amdgpu_sync_free(&job->sync);
>> >>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>> >>>>>>>>>> kfree(job);
>> >>>>>>>>> The job itself is freed up directly after freeing the reference
>> >>>>>>>>> to the
>> >>>>> s_fence.
>> >>>>>>>>> So you are just papering over a much bigger problem here. This
>> >>>>>>>>> patch is a clear NAK.
>> >>>>>>>>>
>> >>>>>>>>> Regards,
>> >>>>>>>>> Christian.
>> >>>>>>>>>
>> >>>>>>>>>>>> When you see a job without an s_fence then that means the
>> >>>>>>>>>>>> problem is somewhere else.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Regards,
>> >>>>>>>>>>>> Christian.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> >>>>>>>>>>>>> ---
>> >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>> >>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++---
>> >--
>> >>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>> >>> amdgpu_device_gpu_recover(struct
>> >>>>>>>>>>>> amdgpu_device *adev,
>> >>>>>>>>>>>>> *
>> >>>>>>>>>>>>> * job->base holds a reference to parent fence
>> >>>>>>>>>>>>> */
>> >>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>> >>>>>>>>>>>>> + if (job && job->base.s_fence &&
>> >>>>>>>>>>>>> + job->base.s_fence->parent
>> >>>>>>> &&
>> >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>> >>>>>>>>>>>>> job_signaled = true;
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644
>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>> >drm_sched_increase_karma(struct
>> >>>>>>>>>>>> drm_sched_job
>> >>>>>>>>>>>>> *bad)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> spin_lock(&rq->lock);
>> >>>>>>>>>>>>> list_for_each_entry_safe(entity,
>> >>>>>>>>>>>>> tmp,
>> >>> &rq-
>> >>>>>>>> entities,
>> >>>>>>>>>>>> list) {
>> >>>>>>>>>>>>> - if (bad->s_fence->scheduled.context
>> >>>>>>> ==
>> >>>>>>>>>>>>> - entity->fence_context) {
>> >>>>>>>>>>>>> + if (bad->s_fence &&
>> >>>>>>>>>>>>> + (bad->s_fence-
>> >>>>>>>>>>>>> scheduled.context ==
>> >>>>>>>>>>>>> + entity->fence_context)) {
>> >>>>>>>>>>>>> if
>> >>>>>>>>>>>>> (atomic_read(&bad-
>> >>>>>>>> karma) >
>> >>>>>>>>>>>>> bad->sched-
>> >>>> hang_limit)
>> >>>>>>>>>>>>> if
>> >>>>>>>>>>>>> (entity-
>> >>>> guilty) @@ -376,7 +376,7 @@ void
>> >>>>>>>>>>>>> drm_sched_stop(struct
>> >>>>>>> drm_gpu_scheduler
>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>> >>>>>>>>>>>>> * This iteration is thread safe as sched thread
>> >>>>>>>>>>>>> is
>> >>> stopped.
>> >>>>>>>>>>>>> */
>> >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>> >>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>> >>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent &&
>> >>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>> >>>> s_fence-
>> >>>>>>>> parent,
>> >>>>>>>>>>>>> &s_job->cb)) {
>> >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>> >>> @@ -
>> >>>>>>> 395,7
>> >>>>>>>>>>> +395,8 @@ void
>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>> >>>>>>>>>>>>> *
>> >>>>>>>>>>>>> * Job is still alive so fence
>> >>>>>>>>>>>>> refcount at
>> >>> least 1
>> >>>>>>>>>>>>> */
>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>> >>>>>>> false);
>> >>>>>>>>>>>>> + if (s_job->s_fence)
>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>> >>>>>>>> finished,
>> >>>>>>>>>>>> false);
>> >>>>>>>>>>>>> /*
>> >>>>>>>>>>>>> * We must keep bad job alive
>> >>>>>>>>>>>>> for later
>> >>> use
>> >>>>>>> during @@
>> >>>>>>>>>>>> -438,7
>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>> >>>>> *sched,
>> >>>>>>>>>>>>> +bool
>> >>>>>>>>>>>> full_recovery)
>> >>>>>>>>>>>>> * GPU recovers can't run in parallel.
>> >>>>>>>>>>>>> */
>> >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>> >>>>>>>>>>>>> &sched->ring_mirror_list,
>> >>>>>>>>>>>>> node)
>> >>>>>>>>>>>> {
>> >>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent;
>> >>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ?
>> >>>>>>>>>>>>> + s_job-
>> >>>>>>>> s_fence-
>> >>>>>>>>>>>>> parent :
>> >>>>>>>>>>>>> +NULL;
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>> >>>>>>>>>>>>>
>> >>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>> amd-gfx mailing list
>> >>>>>>>>>>> amd-gfx at lists.freedesktop.org
>> >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>> >
>> >_______________________________________________
>> >amd-gfx mailing list
>> >amd-gfx at lists.freedesktop.org
>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191112/6b80075c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: basic_fix.patch
Type: text/x-patch
Size: 647 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191112/6b80075c/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simulate_crash.patch
Type: text/x-patch
Size: 1473 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191112/6b80075c/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crash.log
Type: text/x-log
Size: 5705 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191112/6b80075c/attachment-0005.bin>
More information about the amd-gfx
mailing list