[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Mon Nov 18 17:01:57 UTC 2019


The documentation states it can be used safely with concurrent 
list_del_init so I assume it's true - but I think my even bigger mistake 
is that without locking i just do list_first_entry right after 
list_empty_careful and by this can grab pointer to the same job as 
concurrent drm_sched_job_timedout->list_first_entry_or_null - so yes I 
see now i have to use locking as you advised there and then i don't need 
the list_empty_careful.

Andrey

On 11/18/19 11:44 AM, Christian König wrote:
> list_empty_careful() should only be used for optimizing cases, but 
> never if you need to rely on the result.
>
> The problem is that the function doesn't has any memory barriers 
> whatsoever, it just checks if the next and prev pointer are both empty 
> instead of just the next pointer.
>
> Christian.
>
> Am 18.11.19 um 17:23 schrieb Andrey Grodzovsky:
>> Can you explain why ? As I see it - list_empty_careful is 
>> specifically designed for the case where the only other concurrent 
>> operation in progress is list_del_init 
>> (https://www.kernel.org/doc/htmldocs/kernel-api/API-list-empty-careful.html) 
>> - which is exactly what happens in this patch, no other list altering 
>> operation can take place concurrently - so it looks safe to use for me.
>>
>> Andrey
>>
>> On 11/18/19 11:16 AM, Christian König wrote:
>>> Hi Andrey,
>>>
>>> the only thing which doesn't looks so good is the switch to 
>>> list_empty_careful in drm_sched_cleanup_jobs.
>>>
>>> We either take the lock here or we don't, but please not that extra 
>>> checking.
>>>
>>> Christian.
>>>
>>> Am 18.11.19 um 15:07 schrieb Andrey Grodzovsky:
>>>> Thanks Emily.
>>>>
>>>> Christan - ping for review.
>>>>
>>>> Andrey
>>>>
>>>> On 11/14/19 11:39 PM, Deng, Emily wrote:
>>>>> Hi Andrey,
>>>>>       Currently, I am busying with another issue, maybe will try 
>>>>> next week.
>>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>> Sent: Friday, November 15, 2019 6:14 AM
>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Attached.
>>>>>>
>>>>>> Emily - can you give it a try ?
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 11/14/19 3:12 AM, Christian König wrote:
>>>>>>>> What about instead of peeking at the job to actually remove it 
>>>>>>>> from
>>>>>>>> ring_mirror_list right there,
>>>>>>> Also an interesting idea. We would need to protect the mirror list
>>>>>>> with a lock again, but that should be the lesser evil.
>>>>>>>
>>>>>>> Maybe prototype that and see if it works or not.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>>>>>>> Another more fundamental question: Could we get rid of the 
>>>>>>>>> timeout
>>>>>>>>> job at all?
>>>>>>>>
>>>>>>>> There are other stuff there besides picking the first 
>>>>>>>> unfinished job
>>>>>>>> which is common for all the drivers - such as freeing guilty 
>>>>>>>> signaled
>>>>>>>> job and rearming the timeout work timer.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I mean we used to give this as parameter to the scheduler 
>>>>>>>>> callback
>>>>>>>>> because we had the timeout worker in the job, but that is no 
>>>>>>>>> longer
>>>>>>>>> the case.
>>>>>>>>>
>>>>>>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>>>>>>          job = 
>>>>>>>>>> list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>>>>>                                         struct drm_sched_job, 
>>>>>>>>>> node);
>>>>>>>>> Why don't we just remove that here and only get the first job 
>>>>>>>>> after
>>>>>>>>> we have stopped the scheduler?
>>>>>>>>
>>>>>>>> Should be ok since we have the extra check for 
>>>>>>>> __kthread_should_park
>>>>>>>> in drm_sched_cleanup_jobs which will protect us in this case 
>>>>>>>> from a
>>>>>>>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs
>>>>>>>> after we already parked it. The problem here is we need the
>>>>>>>> drm_sched_job to access the private data for each client driver 
>>>>>>>> (see
>>>>>>>> amdgpu_job_timedout for example). What about instead of peeking at
>>>>>>>> the job to actually remove it from ring_mirror_list right 
>>>>>>>> there, go
>>>>>>>> ahead with it through the reset routine, if it's signaled in the
>>>>>>>> meanwhile that great - release it, otherwise put it back into
>>>>>>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>>>>>> This why I asked for a trace with timer enabled, but since 
>>>>>>>>>> there is
>>>>>>>>>> a finite number of places we touch the timer Emily can just put
>>>>>>>>>> prints there. Also, I wonder if this temp fix helps her with the
>>>>>>>>>> issue or not.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>>>>>>> The question is where do we rearm the timer for this problem to
>>>>>>>>>>> occur?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>>>>>>> simulate_crash.patch - waiting on guilty job to signal in 
>>>>>>>>>>>> reset
>>>>>>>>>>>> work and artificially rearming the timeout timer just 
>>>>>>>>>>>> before the
>>>>>>>>>>>> check for !cancel_delayed_work(&sched->work_tdr) in
>>>>>>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. 
>>>>>>>>>>>> This I
>>>>>>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>>>>>>
>>>>>>>>>>>> basic_fix.patch handles this by testing whether another timer
>>>>>>>>>>>> already armed ob this scheduler or is there a timeout work in
>>>>>>>>>>>> execution right now (see documentation for work_busy) - 
>>>>>>>>>>>> obviously
>>>>>>>>>>>> this is not a full solution as this will not protect from 
>>>>>>>>>>>> races
>>>>>>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>>>>>>> drm_sched_fault -  so we probably need to account for this by
>>>>>>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>>>>>>> iterates ring mirror list and frees jobs) and GPU reset really
>>>>>>>>>>>> mutually exclusive and not like now.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>
>>>>>>>>>>>>> you need to print which scheduler instance is freeing the 
>>>>>>>>>>>>> jobs
>>>>>>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>>>>>>> completely meaningless here since we are called from 
>>>>>>>>>>>>> different
>>>>>>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apart from that I will look into this a bit deeper when I 
>>>>>>>>>>>>> have
>>>>>>>>>>>>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>      I add the follow print in function 
>>>>>>>>>>>>>> drm_sched_cleanup_jobs.
>>>>>>>>>>>>>>  From the log it shows that only use cancel_delayed_work 
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>> not avoid to free job when the sched is in reset. But don’t
>>>>>>>>>>>>>> know exactly where it is wrong about the driver. Do you have
>>>>>>>>>>>>>> any suggestion about this?
>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>>>>>>          /*
>>>>>>>>>>>>>>           * Don't destroy jobs while the timeout worker is
>>>>>>>>>>>>>> running  OR thread
>>>>>>>>>>>>>>           * is being parked and hence assumed to not touch
>>>>>>>>>>>>>> ring_mirror_list
>>>>>>>>>>>>>>           */
>>>>>>>>>>>>>>           if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>>>>>>                  return;
>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>>>>>>>>>> current->tgid, current->pid);
>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>> ring
>>>>>>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12
>>>>>>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20
>>>>>>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, 
>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>>>>>>> pid:2253
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, 
>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>>>>>>>> Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, 
>>>>>>>>>>>>>>> Emily
>>>>>>>>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>> Thinking more about this claim - we assume here that if
>>>>>>>>>>>>>> cancel_delayed_work
>>>>>>>>>>>>>>> returned true it guarantees that timeout work is not 
>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> but, it merely
>>>>>>>>>>>>>>> means there was a pending timeout work which was removed
>>>>>> from
>>>>>>>>>>>>>>> the workqueue before it's timer elapsed and so it didn't 
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> chance to be
>>>>>>>>>>>>>>> dequeued and executed, it doesn't cover already executing
>>>>>>>>>>>>>> work. So there is a
>>>>>>>>>>>>>>> possibility where while timeout work started executing 
>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> timeout work
>>>>>>>>>>>>>>> already got enqueued (maybe through earlier cleanup jobs or
>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>> drm_sched_fault) and if at this point another
>>>>>>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>>>>>>>> cancel_delayed_work(&sched->work_tdr) will return true even
>>>>>>>>>>>>>> while there is a
>>>>>>>>>>>>>>> timeout job in progress.
>>>>>>>>>>>>>>> Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>>>>>>>> cancel_delayed_work_sync to flush the timeout work as 
>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>> work itself
>>>>>>>>>>>>>>> waits for schedule thread  to be parked again when calling
>>>>>>>>>>>>>> park_thread.
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>> Sent: 08 November 2019 05:35:18
>>>>>>>>>>>>>>> To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> exactly that can't happen. See here:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>           /* Don't destroy jobs while the timeout 
>>>>>>>>>>>>>>>> worker is
>>>>>>>>>>>>>> running */
>>>>>>>>>>>>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>>>>>>>>>                   return NULL;
>>>>>>>>>>>>>>> We never free jobs while the timeout working is running to
>>>>>>>>>>>>>> prevent exactly
>>>>>>>>>>>>>>> that issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>         The drm_sched_job_timedout-> 
>>>>>>>>>>>>>>>> amdgpu_job_timedout call
>>>>>>>>>>>>>>> amdgpu_device_gpu_recover. I mean the main scheduler 
>>>>>>>>>>>>>>> free the
>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>> in amdgpu_device_gpu_recover, and before calling
>>>>>> drm_sched_stop.
>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> well who is calling amdgpu_device_gpu_recover() in 
>>>>>>>>>>>>>>>>> this case?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When it's not the scheduler we shouldn't have a guilty 
>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>> in the first place.
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>> Hi Chrisitan,
>>>>>>>>>>>>>>>>>>          No, I am with the new branch and also has the
>>>>>>>>>>>>>> patch. Even it
>>>>>>>>>>>>>>>>>> are freed by
>>>>>>>>>>>>>>>>> main scheduler, how we could avoid main scheduler to free
>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> in this case you are on an old code branch.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Jobs are freed now by the main scheduler thread and 
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> if no
>>>>>>>>>>>>>>>>>>> timeout handler is running.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> See this patch here:
>>>>>>>>>>>>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>>>>>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>>>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>>>>           Please refer to follow log, when it enter to
>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>>>>>>>>>>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>> amdgpu_job_free_cb at the same time, because of the
>>>>>>>>>>>>>> hardware fence
>>>>>>>>>>>>>>>>> signal.
>>>>>>>>>>>>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this 
>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>> the s_fence is already freed, but job is not freed 
>>>>>>>>>>>>>>>>>>> in time.
>>>>>>>>>>>>>> Then this issue
>>>>>>>>>>>>>>> occurs.
>>>>>>>>>>>>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>>>>>>> *ERROR* ring
>>>>>>>>>>>>>>>>> sdma0
>>>>>>>>>>>>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>>>>>>>>>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>> *ERROR*
>>>>>>>>>>>>>>>>>>>> Process
>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>>>>>>> 449.794163]
>>>>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>>>>>>> 449.794221]
>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>> thread pid 0,
>>>>>>>>>>>>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>>>>> information: process pid 0 thread pid 0,
>>>>>>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>>>>>>>>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>>>>>>> dereference
>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>>>>>> [  449.801040]
>>>>>>>>>>>>>> Oops:
>>>>>>>>>>>>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>>>>>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 
>>>>>>>>>>>>>>>>>>>> Tainted:
>>>>>>>>>>>>>> G OE
>>>>>>>>>>>>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>>>>>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX
>>>>>> +
>>>>>>>>>>>>>> PIIX,
>>>>>>>>>>>>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>>>>>>>>>>>>> 449.802944]
>>>>>>>>>>>>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>>>>>> [
>>>>>>>>>>>>>>>>>>>> 449.803488]
>>>>>>>>>>>>>>>>> RIP:
>>>>>>>>>>>>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>>>>>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 
>>>>>>>>>>>>>>>>>>>> 0f 85
>>>>>>>>>>>>>> 56 ff ff
>>>>>>>>>>>>>>>>>>>> ff
>>>>>>>>>>>>>>>>>>>> 45 85 e4 0f
>>>>>>>>>>>>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 
>>>>>>>>>>>>>>>>>>> 00 48
>>>>>>>>>>>>>> 8b 40 10
>>>>>>>>>>>>>>>>>>> <48> 8b
>>>>>>>>>>>>>>>>> 98
>>>>>>>>>>>>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>>>>>>> 48 a8 01
>>>>>>>>>>>>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>>>>>>> 00010286 [
>>>>>>>>>>>>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX:
>>>>>> 0000000000000000
>>>>>>>>>>>>>> RCX:
>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>>>>>> RSI:
>>>>>>>>>>>>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224]
>>>>>> RBP:
>>>>>>>>>>>>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>>> 449.807818] R10: 0000000000000000 R11:
>>>>>> 0000000000000148
>>>>>>>>>>>>>> R12:
>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>>>>>> R14:
>>>>>>>>>>>>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 
>>>>>>>>>>>>>>>>>>>> 449.809004] FS:
>>>>>>>>>>>>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>>>>>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 
>>>>>>>>>>>>>>>>>>>> 0000
>>>>>>>>>>>>>> ES: 0000
>>>>>>>>>>>>>>> CR0:
>>>>>>>>>>>>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>>>>>> CR3:
>>>>>>>>>>>>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747]
>>>>>> DR0:
>>>>>>>>>>>>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>> 449.811344] DR3: 0000000000000000 DR6:
>>>>>> 00000000fffe0ff0 DR7:
>>>>>>>>>>>>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>>>>>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140
>>>>>> [amdgpu]
>>>>>>>>>>>>>>>>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>>>>>> [amd_sched] [ 449.813139]  ?
>>>>>>>>>>>>>>>>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>>>>>>> [amdgpu] [
>>>>>>>>>>>>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>> [amd_sched] [
>>>>>>>>>>>>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>>>>>>>>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>>>>>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>>>>>>> 449.815374]  ?
>>>>>>>>>>>>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>>>>>>>>>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>>>>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I can't follow how it would be possible for
>>>>>>>>>>>>>> job->s_fence to be
>>>>>>>>>>>>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So it looks like this patch is just papering over 
>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> bigger issues.
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null 
>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>>>>>>>>>>>>> You need to give me at least enough time to wake up
>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>> From: amd-gfx
>>>>>>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>>>>>>>>>> Behalf
>>>>>>>>>>>>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>> From: Christian König
>>>>>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>>>>>>>>>>>>> When the job is already signaled, the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> s_fence is
>>>>>>>>>>>>>> freed.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the 
>>>>>>>>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>> is destroyed.
>>>>>>>>>>>>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>>>>>>>>>>>>> I know it is set to NULL in 
>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup.
>>>>>>>>>>>>>> But in one
>>>>>>>>>>>>>>>>>>>>>>>>> case, when it enter into the
>>>>>>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>>>>>>>>>>>>>>>>>> already in drm_sched_job_cleanup, and at this 
>>>>>>>>>>>>>>>>>>>>>>>>> time,
>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>>>>>>>>>>>>> go to free
>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>>>>>>> faster. At
>>>>>>>>>>>>>>>>>>>>>>>>> that time, job is not freed, but s_fence is 
>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>> NULL.
>>>>>>>>>>>>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>>>>>>>>>>>>> The job itself is freed up directly after 
>>>>>>>>>>>>>>>>>>>>>>> freeing the
>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>> s_fence.
>>>>>>>>>>>>>>>>>>>>>>> So you are just papering over a much bigger problem
>>>>>>>>>>>>>> here. This
>>>>>>>>>>>>>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> When you see a job without an s_fence then that
>>>>>>>>>>>>>> means the
>>>>>>>>>>>>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> |  2
>>>>>>>>>>>>>>>>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c |
>>>>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>> ++++++---
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6
>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>>              * job->base holds a reference to
>>>>>>>>>>>>>> parent fence
>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>>>>>>>>>>>>>> &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>>>>>> parent))
>>>>>>>>>>>>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>>>>>>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>>>>>>>>>> &rq-
>>>>>>>>>>>>>>>>>>>>>> entities,
>>>>>>>>>>>>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> - if
>>>>>>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>>>>>>>>>>>>>> ==
>>>>>>>>>>>>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>>>>>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>>>>>>>>>>>> bad->sched-
>>>>>>>>>>>>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>> (entity-
>>>>>>>>>>>>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>> * This iteration is thread safe as
>>>>>>>>>>>>>> sched thread
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> stopped.
>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> +          if (s_job->s_fence &&
>>>>>>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>> @@ -
>>>>>>>>>>>>>>>>>>>>> 395,7
>>>>>>>>>>>>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>>>>>>>>>> least 1
>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>>>>>>>>>>>>>>> finished,
>>>>>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>> during @@
>>>>>>>>>>>>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>> *sched,
>>>>>>>>>>>>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>>>>> - struct dma_fence *fence =
>>>>>>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>>>>>>>>>>>>>>>>>>>> + struct dma_fence *fence =
>>>>>>>>>>>>>> s_job->s_fence ?
>>>>>>>>>>>>>>>>>>>>>>>>>>> + s_job-
>>>>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx mailing list 
>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>


More information about the amd-gfx mailing list