[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Mon Nov 18 20:01:27 UTC 2019

Well then we should probably update the documentation.

Take a look at the implementation, there is no compiler or SMP barrier 
at all.

Christian.

Am 18.11.19 um 18:01 schrieb Andrey Grodzovsky:
> The documentation states it can be used safely with concurrent 
> list_del_init so I assume it's true - but I think my even bigger 
> mistake is that without locking i just do list_first_entry right after 
> list_empty_careful and by this can grab pointer to the same job as 
> concurrent drm_sched_job_timedout->list_first_entry_or_null - so yes I 
> see now i have to use locking as you advised there and then i don't 
> need the list_empty_careful.
>
> Andrey
>
> On 11/18/19 11:44 AM, Christian König wrote:
>> list_empty_careful() should only be used for optimizing cases, but 
>> never if you need to rely on the result.
>>
>> The problem is that the function doesn't has any memory barriers 
>> whatsoever, it just checks if the next and prev pointer are both 
>> empty instead of just the next pointer.
>>
>> Christian.
>>
>> Am 18.11.19 um 17:23 schrieb Andrey Grodzovsky:
>>> Can you explain why ? As I see it - list_empty_careful is 
>>> specifically designed for the case where the only other concurrent 
>>> operation in progress is list_del_init 
>>> (https://www.kernel.org/doc/htmldocs/kernel-api/API-list-empty-careful.html) 
>>> - which is exactly what happens in this patch, no other list 
>>> altering operation can take place concurrently - so it looks safe to 
>>> use for me.
>>>
>>> Andrey
>>>
>>> On 11/18/19 11:16 AM, Christian König wrote:
>>>> Hi Andrey,
>>>>
>>>> the only thing which doesn't looks so good is the switch to 
>>>> list_empty_careful in drm_sched_cleanup_jobs.
>>>>
>>>> We either take the lock here or we don't, but please not that extra 
>>>> checking.
>>>>
>>>> Christian.
>>>>
>>>> Am 18.11.19 um 15:07 schrieb Andrey Grodzovsky:
>>>>> Thanks Emily.
>>>>>
>>>>> Christan - ping for review.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 11/14/19 11:39 PM, Deng, Emily wrote:
>>>>>> Hi Andrey,
>>>>>>       Currently, I am busying with another issue, maybe will try 
>>>>>> next week.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>> Sent: Friday, November 15, 2019 6:14 AM
>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>>
>>>>>>> Attached.
>>>>>>>
>>>>>>> Emily - can you give it a try ?
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 11/14/19 3:12 AM, Christian König wrote:
>>>>>>>>> What about instead of peeking at the job to actually remove it 
>>>>>>>>> from
>>>>>>>>> ring_mirror_list right there,
>>>>>>>> Also an interesting idea. We would need to protect the mirror list
>>>>>>>> with a lock again, but that should be the lesser evil.
>>>>>>>>
>>>>>>>> Maybe prototype that and see if it works or not.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>>>>>>>
>>>>>>>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>>>>>>>> Another more fundamental question: Could we get rid of the 
>>>>>>>>>> timeout
>>>>>>>>>> job at all?
>>>>>>>>>
>>>>>>>>> There are other stuff there besides picking the first 
>>>>>>>>> unfinished job
>>>>>>>>> which is common for all the drivers - such as freeing guilty 
>>>>>>>>> signaled
>>>>>>>>> job and rearming the timeout work timer.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I mean we used to give this as parameter to the scheduler 
>>>>>>>>>> callback
>>>>>>>>>> because we had the timeout worker in the job, but that is no 
>>>>>>>>>> longer
>>>>>>>>>> the case.
>>>>>>>>>>
>>>>>>>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>>>>>>>          job = 
>>>>>>>>>>> list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>>>>>>                                         struct 
>>>>>>>>>>> drm_sched_job, node);
>>>>>>>>>> Why don't we just remove that here and only get the first job 
>>>>>>>>>> after
>>>>>>>>>> we have stopped the scheduler?
>>>>>>>>>
>>>>>>>>> Should be ok since we have the extra check for 
>>>>>>>>> __kthread_should_park
>>>>>>>>> in drm_sched_cleanup_jobs which will protect us in this case 
>>>>>>>>> from a
>>>>>>>>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs
>>>>>>>>> after we already parked it. The problem here is we need the
>>>>>>>>> drm_sched_job to access the private data for each client 
>>>>>>>>> driver (see
>>>>>>>>> amdgpu_job_timedout for example). What about instead of 
>>>>>>>>> peeking at
>>>>>>>>> the job to actually remove it from ring_mirror_list right 
>>>>>>>>> there, go
>>>>>>>>> ahead with it through the reset routine, if it's signaled in the
>>>>>>>>> meanwhile that great - release it, otherwise put it back into
>>>>>>>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>>>>>>> This why I asked for a trace with timer enabled, but since 
>>>>>>>>>>> there is
>>>>>>>>>>> a finite number of places we touch the timer Emily can just put
>>>>>>>>>>> prints there. Also, I wonder if this temp fix helps her with 
>>>>>>>>>>> the
>>>>>>>>>>> issue or not.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>>>>>>>> The question is where do we rearm the timer for this 
>>>>>>>>>>>> problem to
>>>>>>>>>>>> occur?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>>>>>>>> simulate_crash.patch - waiting on guilty job to signal in 
>>>>>>>>>>>>> reset
>>>>>>>>>>>>> work and artificially rearming the timeout timer just 
>>>>>>>>>>>>> before the
>>>>>>>>>>>>> check for !cancel_delayed_work(&sched->work_tdr) in
>>>>>>>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. 
>>>>>>>>>>>>> This I
>>>>>>>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>>>>>>>
>>>>>>>>>>>>> basic_fix.patch handles this by testing whether another timer
>>>>>>>>>>>>> already armed ob this scheduler or is there a timeout work in
>>>>>>>>>>>>> execution right now (see documentation for work_busy) - 
>>>>>>>>>>>>> obviously
>>>>>>>>>>>>> this is not a full solution as this will not protect from 
>>>>>>>>>>>>> races
>>>>>>>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>>>>>>>> drm_sched_fault -  so we probably need to account for this by
>>>>>>>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>>>>>>>> iterates ring mirror list and frees jobs) and GPU reset 
>>>>>>>>>>>>> really
>>>>>>>>>>>>> mutually exclusive and not like now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> you need to print which scheduler instance is freeing the 
>>>>>>>>>>>>>> jobs
>>>>>>>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>>>>>>>> completely meaningless here since we are called from 
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apart from that I will look into this a bit deeper when I 
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>      I add the follow print in function 
>>>>>>>>>>>>>>> drm_sched_cleanup_jobs.
>>>>>>>>>>>>>>>  From the log it shows that only use cancel_delayed_work 
>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>> not avoid to free job when the sched is in reset. But don’t
>>>>>>>>>>>>>>> know exactly where it is wrong about the driver. Do you 
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> any suggestion about this?
>>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>>>>>>>          /*
>>>>>>>>>>>>>>>           * Don't destroy jobs while the timeout worker is
>>>>>>>>>>>>>>> running  OR thread
>>>>>>>>>>>>>>>           * is being parked and hence assumed to not touch
>>>>>>>>>>>>>>> ring_mirror_list
>>>>>>>>>>>>>>>           */
>>>>>>>>>>>>>>>           if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>>>>>>>                  return;
>>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
>>>>>>>>>>>>>>> current->tgid, current->pid);
>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>> ring
>>>>>>>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12
>>>>>>>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20
>>>>>>>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process 
>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000fe75ab36, 
>>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>>>>>>>> pid:2253
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, 
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 
>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process 
>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>> process  pid 0 thread  pid 0, s_job:00000000f086ec84, 
>>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>>>>>>>>> Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, 
>>>>>>>>>>>>>>>> Emily
>>>>>>>>>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>> Thinking more about this claim - we assume here that if
>>>>>>>>>>>>>>> cancel_delayed_work
>>>>>>>>>>>>>>>> returned true it guarantees that timeout work is not 
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>> but, it merely
>>>>>>>>>>>>>>>> means there was a pending timeout work which was removed
>>>>>>> from
>>>>>>>>>>>>>>>> the workqueue before it's timer elapsed and so it 
>>>>>>>>>>>>>>>> didn't have
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> chance to be
>>>>>>>>>>>>>>>> dequeued and executed, it doesn't cover already executing
>>>>>>>>>>>>>>> work. So there is a
>>>>>>>>>>>>>>>> possibility where while timeout work started executing 
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>> timeout work
>>>>>>>>>>>>>>>> already got enqueued (maybe through earlier cleanup 
>>>>>>>>>>>>>>>> jobs or
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>> drm_sched_fault) and if at this point another
>>>>>>>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>>>>>>>>> cancel_delayed_work(&sched->work_tdr) will return true 
>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>> while there is a
>>>>>>>>>>>>>>>> timeout job in progress.
>>>>>>>>>>>>>>>> Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>>>>>>>>> cancel_delayed_work_sync to flush the timeout work as 
>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>> work itself
>>>>>>>>>>>>>>>> waits for schedule thread  to be parked again when calling
>>>>>>>>>>>>>>> park_thread.
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>> Sent: 08 November 2019 05:35:18
>>>>>>>>>>>>>>>> To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> exactly that can't happen. See here:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>           /* Don't destroy jobs while the timeout 
>>>>>>>>>>>>>>>>> worker is
>>>>>>>>>>>>>>> running */
>>>>>>>>>>>>>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>>>>>>>>>>                   return NULL;
>>>>>>>>>>>>>>>> We never free jobs while the timeout working is running to
>>>>>>>>>>>>>>> prevent exactly
>>>>>>>>>>>>>>>> that issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>         The drm_sched_job_timedout-> 
>>>>>>>>>>>>>>>>> amdgpu_job_timedout call
>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover. I mean the main scheduler 
>>>>>>>>>>>>>>>> free the
>>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>>> in amdgpu_device_gpu_recover, and before calling
>>>>>>> drm_sched_stop.
>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer 
>>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> well who is calling amdgpu_device_gpu_recover() in 
>>>>>>>>>>>>>>>>>> this case?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When it's not the scheduler we shouldn't have a 
>>>>>>>>>>>>>>>>>> guilty job
>>>>>>>>>>>>>>> in the first place.
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>> Hi Chrisitan,
>>>>>>>>>>>>>>>>>>>          No, I am with the new branch and also has the
>>>>>>>>>>>>>>> patch. Even it
>>>>>>>>>>>>>>>>>>> are freed by
>>>>>>>>>>>>>>>>>> main scheduler, how we could avoid main scheduler to 
>>>>>>>>>>>>>>>>>> free
>>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> in this case you are on an old code branch.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Jobs are freed now by the main scheduler thread and 
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> if no
>>>>>>>>>>>>>>>>>>>> timeout handler is running.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> See this patch here:
>>>>>>>>>>>>>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>>>>>>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>>>>>           Please refer to follow log, when it 
>>>>>>>>>>>>>>>>>>>>> enter to
>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>>>>>>>>>>>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>> amdgpu_job_free_cb at the same time, because of the
>>>>>>>>>>>>>>> hardware fence
>>>>>>>>>>>>>>>>>> signal.
>>>>>>>>>>>>>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this 
>>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>> the s_fence is already freed, but job is not freed 
>>>>>>>>>>>>>>>>>>>> in time.
>>>>>>>>>>>>>>> Then this issue
>>>>>>>>>>>>>>>> occurs.
>>>>>>>>>>>>>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>>>>>>>> *ERROR* ring
>>>>>>>>>>>>>>>>>> sdma0
>>>>>>>>>>>>>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>>>>>>>>>>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>> *ERROR*
>>>>>>>>>>>>>>>>>>>>> Process
>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>>>>>>>> 449.794163]
>>>>>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>>>>>>>> 449.794221]
>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information: process
>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>> thread pid 0,
>>>>>>>>>>>>>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>>>>>> information: process pid 0 thread pid 0,
>>>>>>>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>>>>>>>>>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>>>>>>>> dereference
>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>>>>>>> [  449.801040]
>>>>>>>>>>>>>>> Oops:
>>>>>>>>>>>>>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>>>>>>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 
>>>>>>>>>>>>>>>>>>>>> Tainted:
>>>>>>>>>>>>>>> G OE
>>>>>>>>>>>>>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>>>>>>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX
>>>>>>> +
>>>>>>>>>>>>>>> PIIX,
>>>>>>>>>>>>>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>>>>>>>>>>>>>> 449.802944]
>>>>>>>>>>>>>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>>>>>>> [
>>>>>>>>>>>>>>>>>>>>> 449.803488]
>>>>>>>>>>>>>>>>>> RIP:
>>>>>>>>>>>>>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>>>>>>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 
>>>>>>>>>>>>>>>>>>>>> 0f 85
>>>>>>>>>>>>>>> 56 ff ff
>>>>>>>>>>>>>>>>>>>>> ff
>>>>>>>>>>>>>>>>>>>>> 45 85 e4 0f
>>>>>>>>>>>>>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 
>>>>>>>>>>>>>>>>>>>> 00 48
>>>>>>>>>>>>>>> 8b 40 10
>>>>>>>>>>>>>>>>>>>> <48> 8b
>>>>>>>>>>>>>>>>>> 98
>>>>>>>>>>>>>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>>>>>>>> 48 a8 01
>>>>>>>>>>>>>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>>>>>>>> 00010286 [
>>>>>>>>>>>>>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX:
>>>>>>> 0000000000000000
>>>>>>>>>>>>>>> RCX:
>>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>>>>>>> RSI:
>>>>>>>>>>>>>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224]
>>>>>>> RBP:
>>>>>>>>>>>>>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>>>> 449.807818] R10: 0000000000000000 R11:
>>>>>>> 0000000000000148
>>>>>>>>>>>>>>> R12:
>>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>>>>>>> R14:
>>>>>>>>>>>>>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 
>>>>>>>>>>>>>>>>>>>>> 449.809004] FS:
>>>>>>>>>>>>>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>>>>>>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS:  0010 DS: 
>>>>>>>>>>>>>>>>>>>>> 0000
>>>>>>>>>>>>>>> ES: 0000
>>>>>>>>>>>>>>>> CR0:
>>>>>>>>>>>>>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>>>>>>> CR3:
>>>>>>>>>>>>>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747]
>>>>>>> DR0:
>>>>>>>>>>>>>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>>> 449.811344] DR3: 0000000000000000 DR6:
>>>>>>> 00000000fffe0ff0 DR7:
>>>>>>>>>>>>>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>>>>>>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140
>>>>>>> [amdgpu]
>>>>>>>>>>>>>>>>>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>>>>>>> [amd_sched] [ 449.813139] ?
>>>>>>>>>>>>>>>>>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>>>>>>>> [amdgpu] [
>>>>>>>>>>>>>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>> [amd_sched] [
>>>>>>>>>>>>>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 
>>>>>>>>>>>>>>>>>>>>> 449.814417]
>>>>>>>>>>>>>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>>>>>>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>>>>>>>> 449.815374]  ?
>>>>>>>>>>>>>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>>>>>>>>>>>>>> [  449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null 
>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>>>>>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I can't follow how it would be possible for
>>>>>>>>>>>>>>> job->s_fence to be
>>>>>>>>>>>>>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So it looks like this patch is just papering over 
>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> bigger issues.
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null 
>>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>>>>>>>>>>>>>> You need to give me at least enough time to 
>>>>>>>>>>>>>>>>>>>>>>>> wake up
>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>> From: amd-gfx
>>>>>>>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>>>>>>>>>>> Behalf
>>>>>>>>>>>>>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>>>>>>>>>>>>>> To: Koenig, Christian 
>>>>>>>>>>>>>>>>>>>>>>>>>> <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>>> From: Christian König
>>>>>>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> When the job is already signaled, the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> s_fence is
>>>>>>>>>>>>>>> freed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the job
>>>>>>>>>>>>>>> is destroyed.
>>>>>>>>>>>>>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>>>>>>>>>>>>>> I know it is set to NULL in 
>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup.
>>>>>>>>>>>>>>> But in one
>>>>>>>>>>>>>>>>>>>>>>>>>> case, when it enter into the
>>>>>>>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>>>>>>>>>>>>>>>>>>> already in drm_sched_job_cleanup, and at this 
>>>>>>>>>>>>>>>>>>>>>>>>>> time,
>>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>>>>>>>>>>>>>> go to free
>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>>>>>>>> faster. At
>>>>>>>>>>>>>>>>>>>>>>>>>> that time, job is not freed, but s_fence is 
>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>> NULL.
>>>>>>>>>>>>>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, 
>>>>>>>>>>>>>>>>>>>>>>>>> s_job->s_priority);
>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>>>>>>>>>>>>>> The job itself is freed up directly after 
>>>>>>>>>>>>>>>>>>>>>>>> freeing the
>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>> s_fence.
>>>>>>>>>>>>>>>>>>>>>>>> So you are just papering over a much bigger 
>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>> here. This
>>>>>>>>>>>>>>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> When you see a job without an s_fence then that
>>>>>>>>>>>>>>> means the
>>>>>>>>>>>>>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> |  2
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c |
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>> ++++++---
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6
>>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              * job->base holds a reference to
>>>>>>>>>>>>>>> parent fence
>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -  if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +  if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>>>>>>>>>>>>>>> &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>>>>>>> parent))
>>>>>>>>>>>>>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>>>>>>>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>>>>>>>>>>> &rq-
>>>>>>>>>>>>>>>>>>>>>>> entities,
>>>>>>>>>>>>>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - if
>>>>>>>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>>>>>>>>>>>>>>> ==
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +                          if (bad->s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>>>>>>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> bad->sched-
>>>>>>>>>>>>>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (entity-
>>>>>>>>>>>>>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * This iteration is thread safe as
>>>>>>>>>>>>>>> sched thread
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> stopped.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -          if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +          if (s_job->s_fence &&
>>>>>>>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>> @@ -
>>>>>>>>>>>>>>>>>>>>>> 395,7
>>>>>>>>>>>>>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>>>>>>>>>>> least 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>>>>>>>>>>>>>>>> finished,
>>>>>>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>> during @@
>>>>>>>>>>>>>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>> *sched,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - struct dma_fence *fence =
>>>>>>>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + struct dma_fence *fence =
>>>>>>>>>>>>>>> s_job->s_fence ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> + s_job-
>>>>>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx mailing list 
>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx