[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Mon Nov 18 14:07:52 UTC 2019


Thanks Emily.

Christan - ping for review.

Andrey

On 11/14/19 11:39 PM, Deng, Emily wrote:
> Hi Andrey,
>       Currently, I am busying with another issue, maybe will try next week.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> Sent: Friday, November 15, 2019 6:14 AM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Attached.
>>
>> Emily - can you give it a try ?
>>
>> Andrey
>>
>> On 11/14/19 3:12 AM, Christian König wrote:
>>>> What about instead of peeking at the job to actually remove it from
>>>> ring_mirror_list right there,
>>> Also an interesting idea. We would need to protect the mirror list
>>> with a lock again, but that should be the lesser evil.
>>>
>>> Maybe prototype that and see if it works or not.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>>
>>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>>> Another more fundamental question: Could we get rid of the timeout
>>>>> job at all?
>>>>
>>>> There are other stuff there besides picking the first unfinished job
>>>> which is common for all the drivers - such as freeing guilty signaled
>>>> job and rearming the timeout work timer.
>>>>
>>>>
>>>>> I mean we used to give this as parameter to the scheduler callback
>>>>> because we had the timeout worker in the job, but that is no longer
>>>>> the case.
>>>>>
>>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>>          job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>                                         struct drm_sched_job, node);
>>>>> Why don't we just remove that here and only get the first job after
>>>>> we have stopped the scheduler?
>>>>
>>>> Should be ok since we have the extra check for __kthread_should_park
>>>> in drm_sched_cleanup_jobs which will protect us in this case from a
>>>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs
>>>> after we already parked it. The problem here is we need the
>>>> drm_sched_job to access the private data for each client driver (see
>>>> amdgpu_job_timedout for example). What about instead of peeking at
>>>> the job to actually remove it from ring_mirror_list right there, go
>>>> ahead with it through the reset routine, if it's signaled in the
>>>> meanwhile that great - release it, otherwise put it back into
>>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>> This why I asked for a trace with timer enabled, but since there is
>>>>>> a finite number of places we touch the timer Emily can just put
>>>>>> prints there. Also, I wonder if this temp fix helps her with the
>>>>>> issue or not.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>>> The question is where do we rearm the timer for this problem to
>>>>>>> occur?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>>> simulate_crash.patch - waiting on guilty job to signal in reset
>>>>>>>> work and artificially rearming the timeout timer just before the
>>>>>>>> check for !cancel_delayed_work(&sched->work_tdr)  in
>>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I
>>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>>
>>>>>>>> basic_fix.patch handles this by testing whether another timer
>>>>>>>> already armed ob this scheduler or is there a timeout work in
>>>>>>>> execution right now (see documentation for work_busy) - obviously
>>>>>>>> this is not a full solution as this will not protect from races
>>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>>> drm_sched_fault -  so we probably need to account for this by
>>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>>> iterates ring mirror list and frees jobs) and GPU reset really
>>>>>>>> mutually exclusive and not like now.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>>> Hi Emily,
>>>>>>>>>
>>>>>>>>> you need to print which scheduler instance is freeing the jobs
>>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>>> completely meaningless here since we are called from different
>>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>>
>>>>>>>>> Apart from that I will look into this a bit deeper when I have
>>>>>>>>> time.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>>> Hi Christian,
>>>>>>>>>>      I add the follow print in function drm_sched_cleanup_jobs.
>>>>>>>>>>  From the log it shows that only use cancel_delayed_work could
>>>>>>>>>> not avoid to free job when the sched is in reset. But don’t
>>>>>>>>>> know exactly where it is wrong about the driver. Do you have
>>>>>>>>>> any suggestion about this?
>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>>          /*
>>>>>>>>>>           * Don't destroy jobs while the timeout worker is
>>>>>>>>>> running  OR thread
>>>>>>>>>>           * is being parked and hence assumed to not touch
>>>>>>>>>> ring_mirror_list
>>>>>>>>>>           */
>>>>>>>>>>           if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>>                  return;
>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>>>>>> current->tgid, current->pid);
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>> ring
>>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12
>>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20
>>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>>> pid:2253
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, tid:2262,
>>>>>>>>>> pid:2262
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>>>> Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>> for tdr
>>>>>>>>>>> Thinking more about this claim - we assume here that if
>>>>>>>>>> cancel_delayed_work
>>>>>>>>>>> returned true it guarantees that timeout work is not running
>>>>>>>>>> but, it merely
>>>>>>>>>>> means there was a pending timeout work which was removed
>> from
>>>>>>>>>>> the workqueue before it's timer elapsed and so it didn't have
>>>>>>>>>>> a
>>>>>>>>>> chance to be
>>>>>>>>>>> dequeued and executed, it doesn't cover already executing
>>>>>>>>>> work. So there is a
>>>>>>>>>>> possibility where while timeout work started executing another
>>>>>>>>>> timeout work
>>>>>>>>>>> already got enqueued (maybe through earlier cleanup jobs or
>>>>>>>>>> through
>>>>>>>>>>> drm_sched_fault) and if at this point another
>>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>>>> cancel_delayed_work(&sched->work_tdr) will return true even
>>>>>>>>>> while there is a
>>>>>>>>>>> timeout job in progress.
>>>>>>>>>>> Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>>>> cancel_delayed_work_sync to flush the timeout work as timeout
>>>>>>>>>> work itself
>>>>>>>>>>> waits for schedule thread  to be parked again when calling
>>>>>>>>>> park_thread.
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>> behalf of
>>>>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>> Sent: 08 November 2019 05:35:18
>>>>>>>>>>> To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>> for tdr
>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>
>>>>>>>>>>> exactly that can't happen. See here:
>>>>>>>>>>>
>>>>>>>>>>>>           /* Don't destroy jobs while the timeout worker is
>>>>>>>>>> running */
>>>>>>>>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>>>>>                   return NULL;
>>>>>>>>>>> We never free jobs while the timeout working is running to
>>>>>>>>>> prevent exactly
>>>>>>>>>>> that issue.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>         The drm_sched_job_timedout-> amdgpu_job_timedout call
>>>>>>>>>>> amdgpu_device_gpu_recover. I mean the main scheduler free the
>>>>>>>>>> jobs while
>>>>>>>>>>> in amdgpu_device_gpu_recover, and before calling
>> drm_sched_stop.
>>>>>>>>>>>> Best wishes
>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>> for tdr
>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>
>>>>>>>>>>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>>>>>>>>
>>>>>>>>>>>>> When it's not the scheduler we shouldn't have a guilty job
>>>>>>>>>> in the first place.
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>>>>>>> Hi Chrisitan,
>>>>>>>>>>>>>>          No, I am with the new branch and also has the
>>>>>>>>>> patch. Even it
>>>>>>>>>>>>>> are freed by
>>>>>>>>>>>>> main scheduler, how we could avoid main scheduler to free
>>>>>>>>>> jobs while
>>>>>>>>>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in this case you are on an old code branch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jobs are freed now by the main scheduler thread and only
>>>>>>>>>> if no
>>>>>>>>>>>>>>> timeout handler is running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See this patch here:
>>>>>>>>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>           Please refer to follow log, when it enter to
>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>>>>>>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>>> function
>>>>>>>>>>>>>>> amdgpu_job_free_cb  at the same time, because of the
>>>>>>>>>> hardware fence
>>>>>>>>>>>>> signal.
>>>>>>>>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case,
>>>>>>>>>>>>>>> the s_fence is already freed, but job is not freed in time.
>>>>>>>>>> Then this issue
>>>>>>>>>>> occurs.
>>>>>>>>>>>>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>>> *ERROR* ring
>>>>>>>>>>>>> sdma0
>>>>>>>>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>>>>>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>> *ERROR*
>>>>>>>>>>>>>>>> Process
>>>>>>>>>> information:
>>>>>>>>>>>>>>> process  pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>>> 449.794163]
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>> information:
>>>>>>>>>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>>> 449.794221]
>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>> thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>> thread pid 0,
>>>>>>>>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>> information: process pid 0 thread  pid 0,
>>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>>>>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>>> dereference
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>> [  449.801040]
>>>>>>>>>> Oops:
>>>>>>>>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>>>>>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted:
>>>>>>>>>> G OE
>>>>>>>>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>>>>>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX
>> +
>>>>>>>>>> PIIX,
>>>>>>>>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>>>>>>>>> 449.802944]
>>>>>>>>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>> [
>>>>>>>>>>>>>>>> 449.803488]
>>>>>>>>>>>>> RIP:
>>>>>>>>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>>>>>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85
>>>>>>>>>> 56 ff ff
>>>>>>>>>>>>>>>> ff
>>>>>>>>>>>>>>>> 45 85 e4 0f
>>>>>>>>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48
>>>>>>>>>> 8b 40 10
>>>>>>>>>>>>>>> <48> 8b
>>>>>>>>>>>>> 98
>>>>>>>>>>>>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>>> 48 a8 01
>>>>>>>>>>>>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>>> 00010286 [
>>>>>>>>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX:
>> 0000000000000000
>>>>>>>>>> RCX:
>>>>>>>>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>> RSI:
>>>>>>>>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224]
>> RBP:
>>>>>>>>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>> 449.807818] R10: 0000000000000000 R11:
>> 0000000000000148
>>>>>>>>>> R12:
>>>>>>>>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>> R14:
>>>>>>>>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>>>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 0000
>>>>>>>>>> ES: 0000
>>>>>>>>>>> CR0:
>>>>>>>>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>> CR3:
>>>>>>>>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747]
>> DR0:
>>>>>>>>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>> 449.811344] DR3: 0000000000000000 DR6:
>> 00000000fffe0ff0 DR7:
>>>>>>>>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>>>>>>>>> [  449.812206] amdgpu_job_timedout+0x114/0x140
>> [amdgpu]
>>>>>>>>>>>>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>> [amd_sched] [ 449.813139]  ?
>>>>>>>>>>>>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>>> [amdgpu] [
>>>>>>>>>>>>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>>> [amd_sched] [
>>>>>>>>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>>>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>>>>>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>>> 449.815374]  ?
>>>>>>>>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>>>>>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can't follow how it would be possible for
>>>>>>>>>> job->s_fence to be
>>>>>>>>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So it looks like this patch is just papering over some
>>>>>>>>>> bigger issues.
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>>>>>>>>> You need to give me at least enough time to wake up
>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>> From: amd-gfx
>>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>>>>>> Behalf
>>>>>>>>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>> From: Christian König
>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is
>>>>>>>>>> freed.
>>>>>>>>>>>>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job
>>>>>>>>>> is destroyed.
>>>>>>>>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup.
>>>>>>>>>> But in one
>>>>>>>>>>>>>>>>>>>>> case, when it enter into the
>>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>>>>>>>>>>>>>> already in drm_sched_job_cleanup, and at this time,
>>>>>>>>>> it will
>>>>>>>>>>>>>>>>>>>>> go to free
>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>>> faster. At
>>>>>>>>>>>>>>>>>>>>> that time, job is not freed, but s_fence is already
>>>>>>>>>> NULL.
>>>>>>>>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> s_fence.
>>>>>>>>>>>>>>>>>>> So you are just papering over a much bigger problem
>>>>>>>>>> here. This
>>>>>>>>>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> When you see a job without an s_fence then that
>>>>>>>>>> means the
>>>>>>>>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> |  2
>>>>>>>>>>>>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c     |
>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>> ++++++---
>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>         2 files changed, 7 insertions(+), 6
>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>>>>>>>>> ---
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> +++
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>>              *
>>>>>>>>>>>>>>>>>>>>>>>              * job->base holds a reference to
>>>>>>>>>> parent fence
>>>>>>>>>>>>>>>>>>>>>>>              */
>>>>>>>>>>>>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>>>>>>>>>> &&
>>>>>>>>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>> parent))
>>>>>>>>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>>>>>> &rq-
>>>>>>>>>>>>>>>>>> entities,
>>>>>>>>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>>>>>>>>> -                          if
>>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>>>>>>>>>> ==
>>>>>>>>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>>>>>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>>>>>>>> bad->sched-
>>>>>>>>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>> (entity-
>>>>>>>>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>              * This iteration is thread safe as
>>>>>>>>>> sched thread
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>> stopped.
>>>>>>>>>>>>>>>>>>>>>>>              */
>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>> +          if (s_job->s_fence &&
>>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>>>>>> @@ -
>>>>>>>>>>>>>>>>> 395,7
>>>>>>>>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>>>>>> least 1
>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>> +                  if (s_job->s_fence)
>>>>>>>>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>>>>>>>>>>> finished,
>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> during @@
>>>>>>>>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched,
>>>>>>>>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>>>>>>>>              * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>>>>>>>>              */
>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> -          struct dma_fence *fence =
>>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>>>>>>>>>>>>>>>> +          struct dma_fence *fence =
>>>>>>>>>> s_job->s_fence ?
>>>>>>>>>>>>>>>>>>>>>>> + s_job-
>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>>>>>>
>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> amd-gfx mailing list amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list