[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Mon Nov 18 21:05:04 UTC 2019
Ye, I noticed - I just relied on the documentation in this case.
Andrey
On 11/18/19 3:01 PM, Christian König wrote:
> Well then we should probably update the documentation.
>
> Take a look at the implementation, there is no compiler or SMP barrier
> at all.
>
> Christian.
>
> Am 18.11.19 um 18:01 schrieb Andrey Grodzovsky:
>> The documentation states it can be used safely with concurrent
>> list_del_init so I assume it's true - but I think my even bigger
>> mistake is that without locking i just do list_first_entry right
>> after list_empty_careful and by this can grab pointer to the same job
>> as concurrent drm_sched_job_timedout->list_first_entry_or_null - so
>> yes I see now i have to use locking as you advised there and then i
>> don't need the list_empty_careful.
>>
>> Andrey
>>
>> On 11/18/19 11:44 AM, Christian König wrote:
>>> list_empty_careful() should only be used for optimizing cases, but
>>> never if you need to rely on the result.
>>>
>>> The problem is that the function doesn't has any memory barriers
>>> whatsoever, it just checks if the next and prev pointer are both
>>> empty instead of just the next pointer.
>>>
>>> Christian.
>>>
>>> Am 18.11.19 um 17:23 schrieb Andrey Grodzovsky:
>>>> Can you explain why ? As I see it - list_empty_careful is
>>>> specifically designed for the case where the only other concurrent
>>>> operation in progress is list_del_init
>>>> (https://www.kernel.org/doc/htmldocs/kernel-api/API-list-empty-careful.html)
>>>> - which is exactly what happens in this patch, no other list
>>>> altering operation can take place concurrently - so it looks safe
>>>> to use for me.
>>>>
>>>> Andrey
>>>>
>>>> On 11/18/19 11:16 AM, Christian König wrote:
>>>>> Hi Andrey,
>>>>>
>>>>> the only thing which doesn't looks so good is the switch to
>>>>> list_empty_careful in drm_sched_cleanup_jobs.
>>>>>
>>>>> We either take the lock here or we don't, but please not that
>>>>> extra checking.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 18.11.19 um 15:07 schrieb Andrey Grodzovsky:
>>>>>> Thanks Emily.
>>>>>>
>>>>>> Christan - ping for review.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 11/14/19 11:39 PM, Deng, Emily wrote:
>>>>>>> Hi Andrey,
>>>>>>> Currently, I am busying with another issue, maybe will try
>>>>>>> next week.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>> Sent: Friday, November 15, 2019 6:14 AM
>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily
>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Attached.
>>>>>>>>
>>>>>>>> Emily - can you give it a try ?
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 11/14/19 3:12 AM, Christian König wrote:
>>>>>>>>>> What about instead of peeking at the job to actually remove
>>>>>>>>>> it from
>>>>>>>>>> ring_mirror_list right there,
>>>>>>>>> Also an interesting idea. We would need to protect the mirror
>>>>>>>>> list
>>>>>>>>> with a lock again, but that should be the lesser evil.
>>>>>>>>>
>>>>>>>>> Maybe prototype that and see if it works or not.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky:
>>>>>>>>>>
>>>>>>>>>> On 11/13/19 9:20 AM, Christian König wrote:
>>>>>>>>>>> Another more fundamental question: Could we get rid of the
>>>>>>>>>>> timeout
>>>>>>>>>>> job at all?
>>>>>>>>>>
>>>>>>>>>> There are other stuff there besides picking the first
>>>>>>>>>> unfinished job
>>>>>>>>>> which is common for all the drivers - such as freeing guilty
>>>>>>>>>> signaled
>>>>>>>>>> job and rearming the timeout work timer.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I mean we used to give this as parameter to the scheduler
>>>>>>>>>>> callback
>>>>>>>>>>> because we had the timeout worker in the job, but that is no
>>>>>>>>>>> longer
>>>>>>>>>>> the case.
>>>>>>>>>>>
>>>>>>>>>>> E.g. in drm_sched_job_timedout() we do the following:
>>>>>>>>>>>> job =
>>>>>>>>>>>> list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>>>>>>> struct drm_sched_job, node);
>>>>>>>>>>> Why don't we just remove that here and only get the first
>>>>>>>>>>> job after
>>>>>>>>>>> we have stopped the scheduler?
>>>>>>>>>>
>>>>>>>>>> Should be ok since we have the extra check for
>>>>>>>>>> __kthread_should_park
>>>>>>>>>> in drm_sched_cleanup_jobs which will protect us in this case
>>>>>>>>>> from a
>>>>>>>>>> wakeup of sched thread and execution of in
>>>>>>>>>> drm_sched_cleanup_jobs
>>>>>>>>>> after we already parked it. The problem here is we need the
>>>>>>>>>> drm_sched_job to access the private data for each client
>>>>>>>>>> driver (see
>>>>>>>>>> amdgpu_job_timedout for example). What about instead of
>>>>>>>>>> peeking at
>>>>>>>>>> the job to actually remove it from ring_mirror_list right
>>>>>>>>>> there, go
>>>>>>>>>> ahead with it through the reset routine, if it's signaled in the
>>>>>>>>>> meanwhile that great - release it, otherwise put it back into
>>>>>>>>>> ring_mirror_list in drm_sched_resubmit_jobs.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> This why I asked for a trace with timer enabled, but since
>>>>>>>>>>>> there is
>>>>>>>>>>>> a finite number of places we touch the timer Emily can just
>>>>>>>>>>>> put
>>>>>>>>>>>> prints there. Also, I wonder if this temp fix helps her
>>>>>>>>>>>> with the
>>>>>>>>>>>> issue or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/13/19 2:36 AM, Christian König wrote:
>>>>>>>>>>>>> The question is where do we rearm the timer for this
>>>>>>>>>>>>> problem to
>>>>>>>>>>>>> occur?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>> I was able to reproduce the crash by using the attached
>>>>>>>>>>>>>> simulate_crash.patch - waiting on guilty job to signal in
>>>>>>>>>>>>>> reset
>>>>>>>>>>>>>> work and artificially rearming the timeout timer just
>>>>>>>>>>>>>> before the
>>>>>>>>>>>>>> check for !cancel_delayed_work(&sched->work_tdr) in
>>>>>>>>>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log.
>>>>>>>>>>>>>> This I
>>>>>>>>>>>>>> think confirms my theory i described earlier in this thread.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> basic_fix.patch handles this by testing whether another
>>>>>>>>>>>>>> timer
>>>>>>>>>>>>>> already armed ob this scheduler or is there a timeout
>>>>>>>>>>>>>> work in
>>>>>>>>>>>>>> execution right now (see documentation for work_busy) -
>>>>>>>>>>>>>> obviously
>>>>>>>>>>>>>> this is not a full solution as this will not protect from
>>>>>>>>>>>>>> races
>>>>>>>>>>>>>> if for example there is immediate work scheduling such as in
>>>>>>>>>>>>>> drm_sched_fault - so we probably need to account for
>>>>>>>>>>>>>> this by
>>>>>>>>>>>>>> making drm_sched_cleanup_jobs (at least in the part where it
>>>>>>>>>>>>>> iterates ring mirror list and frees jobs) and GPU reset
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> mutually exclusive and not like now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/11/19 4:11 PM, Christian König wrote:
>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> you need to print which scheduler instance is freeing
>>>>>>>>>>>>>>> the jobs
>>>>>>>>>>>>>>> and which one is triggering the reset. The TID and PID is
>>>>>>>>>>>>>>> completely meaningless here since we are called from
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> worker threads and the TID/PID can change on each call.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Apart from that I will look into this a bit deeper when
>>>>>>>>>>>>>>> I have
>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily:
>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>> I add the follow print in function
>>>>>>>>>>>>>>>> drm_sched_cleanup_jobs.
>>>>>>>>>>>>>>>> From the log it shows that only use
>>>>>>>>>>>>>>>> cancel_delayed_work could
>>>>>>>>>>>>>>>> not avoid to free job when the sched is in reset. But
>>>>>>>>>>>>>>>> don’t
>>>>>>>>>>>>>>>> know exactly where it is wrong about the driver. Do you
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> any suggestion about this?
>>>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
>>>>>>>>>>>>>>>> pid:%lu\n", current->tgid, current->pid);
>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>> * Don't destroy jobs while the timeout worker is
>>>>>>>>>>>>>>>> running OR thread
>>>>>>>>>>>>>>>> * is being parked and hence assumed to not touch
>>>>>>>>>>>>>>>> ring_mirror_list
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr)))
>>>>>>>>>>>>>>>> return;
>>>>>>>>>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu,
>>>>>>>>>>>>>>>> pid:%lu\n",
>>>>>>>>>>>>>>>> current->tgid, current->pid);
>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11380.695091]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11380.695104]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11380.695107]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>> ring
>>>>>>>>>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587
>>>>>>>>>>>>>>>> Nov 12
>>>>>>>>>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
>>>>>>>>>>>>>>>> Process information: process pid 0 thread pid 0,
>>>>>>>>>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12
>>>>>>>>>>>>>>>> 12:58:20
>>>>>>>>>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
>>>>>>>>>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225417]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225425]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>> process pid 0 thread pid 0, s_job:00000000fe75ab36,
>>>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225429]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225473]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2253,
>>>>>>>>>>>>>>>> pid:2253
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225486]
>>>>>>>>>>>>>>>> Emily:drm_sched_cleanup_jobs:begin,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02
>>>>>>>>>>>>>>>> kernel:
>>>>>>>>>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>> process pid 0 thread pid 0, s_job:00000000f086ec84,
>>>>>>>>>>>>>>>> tid:2262,
>>>>>>>>>>>>>>>> pid:2262
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>>>>>>>>>>>> Sent: Tuesday, November 12, 2019 11:28 AM
>>>>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>> Deng, Emily
>>>>>>>>>>>>>>>>> <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>> Thinking more about this claim - we assume here that if
>>>>>>>>>>>>>>>> cancel_delayed_work
>>>>>>>>>>>>>>>>> returned true it guarantees that timeout work is not
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> but, it merely
>>>>>>>>>>>>>>>>> means there was a pending timeout work which was removed
>>>>>>>> from
>>>>>>>>>>>>>>>>> the workqueue before it's timer elapsed and so it
>>>>>>>>>>>>>>>>> didn't have
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> chance to be
>>>>>>>>>>>>>>>>> dequeued and executed, it doesn't cover already executing
>>>>>>>>>>>>>>>> work. So there is a
>>>>>>>>>>>>>>>>> possibility where while timeout work started executing
>>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>> timeout work
>>>>>>>>>>>>>>>>> already got enqueued (maybe through earlier cleanup
>>>>>>>>>>>>>>>>> jobs or
>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>> drm_sched_fault) and if at this point another
>>>>>>>>>>>>>>>> drm_sched_cleanup_jobs runs
>>>>>>>>>>>>>>>>> cancel_delayed_work(&sched->work_tdr) will return true
>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>> while there is a
>>>>>>>>>>>>>>>>> timeout job in progress.
>>>>>>>>>>>>>>>>> Unfortunately we cannot change cancel_delayed_work to
>>>>>>>>>>>>>>>>> cancel_delayed_work_sync to flush the timeout work as
>>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>> work itself
>>>>>>>>>>>>>>>>> waits for schedule thread to be parked again when
>>>>>>>>>>>>>>>>> calling
>>>>>>>>>>>>>>>> park_thread.
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>> Sent: 08 November 2019 05:35:18
>>>>>>>>>>>>>>>>> To: Deng, Emily; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> exactly that can't happen. See here:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /* Don't destroy jobs while the timeout worker is
>>>>>>>>>>>>>>>> running */
>>>>>>>>>>>>>>>>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>>>>>>>>>>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>>>>>>>>>>>>>>> return NULL;
>>>>>>>>>>>>>>>>> We never free jobs while the timeout working is
>>>>>>>>>>>>>>>>> running to
>>>>>>>>>>>>>>>> prevent exactly
>>>>>>>>>>>>>>>>> that issue.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>> The drm_sched_job_timedout->
>>>>>>>>>>>>>>>>>> amdgpu_job_timedout call
>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover. I mean the main scheduler
>>>>>>>>>>>>>>>>> free the
>>>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>>>> in amdgpu_device_gpu_recover, and before calling
>>>>>>>> drm_sched_stop.
>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> well who is calling amdgpu_device_gpu_recover() in
>>>>>>>>>>>>>>>>>>> this case?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When it's not the scheduler we shouldn't have a
>>>>>>>>>>>>>>>>>>> guilty job
>>>>>>>>>>>>>>>> in the first place.
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>> Hi Chrisitan,
>>>>>>>>>>>>>>>>>>>> No, I am with the new branch and also has the
>>>>>>>>>>>>>>>> patch. Even it
>>>>>>>>>>>>>>>>>>>> are freed by
>>>>>>>>>>>>>>>>>>> main scheduler, how we could avoid main scheduler to
>>>>>>>>>>>>>>>>>>> free
>>>>>>>>>>>>>>>> jobs while
>>>>>>>>>>>>>>>>>>> enter to function amdgpu_device_gpu_recover?
>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>>>>>>> Hi Emily,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> in this case you are on an old code branch.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Jobs are freed now by the main scheduler thread
>>>>>>>>>>>>>>>>>>>>> and only
>>>>>>>>>>>>>>>> if no
>>>>>>>>>>>>>>>>>>>>> timeout handler is running.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> See this patch here:
>>>>>>>>>>>>>>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>>>>>>>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> drm/scheduler: rework job destruction
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>>>>>>>>>>> Please refer to follow log, when it
>>>>>>>>>>>>>>>>>>>>>> enter to
>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>>>>>>>>>>>>>>> function, the bad job 000000005086879e is freeing in
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>> amdgpu_job_free_cb at the same time, because of the
>>>>>>>>>>>>>>>> hardware fence
>>>>>>>>>>>>>>>>>>> signal.
>>>>>>>>>>>>>>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this
>>>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>>> the s_fence is already freed, but job is not freed
>>>>>>>>>>>>>>>>>>>>> in time.
>>>>>>>>>>>>>>>> Then this issue
>>>>>>>>>>>>>>>>> occurs.
>>>>>>>>>>>>>>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>>>>>>>>>> *ERROR* ring
>>>>>>>>>>>>>>>>>>> sdma0
>>>>>>>>>>>>>>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [
>>>>>>>>>>>>>>>>>>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]]
>>>>>>>> *ERROR*
>>>>>>>>>>>>>>>>>>>>>> Process
>>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>>>>>>>>> 449.794163]
>>>>>>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>>>>>>>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>> information:
>>>>>>>>>>>>>>>>>>>>>> process pid 0 thread pid 0,
>>>>>>>>>>>>>>>>>>>>>> s_job:000000005086879e [
>>>>>>>>>>>>>>>> 449.794221]
>>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> pid 0
>>>>>>>>>>>>>>>>>>>>>> thread pid 0,
>>>>>>>>>>>>>>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>>>>>>>>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>>>>>>>>>>>>>>> information: process pid 0 thread pid 0,
>>>>>>>>>>>>>>>> s_job:00000000ed3a5ac6 [
>>>>>>>>>>>>>>>>>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>>>>>>>>>> dereference
>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0
>>>>>>>> [ 449.801040]
>>>>>>>>>>>>>>>> Oops:
>>>>>>>>>>>>>>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>>>>>>>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1
>>>>>>>>>>>>>>>>>>>>>> Tainted:
>>>>>>>>>>>>>>>> G OE
>>>>>>>>>>>>>>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>>>>>>>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC
>>>>>>>>>>>>>>>>>>>>>> (i440FX
>>>>>>>> +
>>>>>>>>>>>>>>>> PIIX,
>>>>>>>>>>>>>>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [
>>>>>>>>>>>>>>>>>>>>>> 449.802944]
>>>>>>>>>>>>>>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched]
>>>>>>>> [
>>>>>>>>>>>>>>>>>>>>>> 449.803488]
>>>>>>>>>>>>>>>>>>> RIP:
>>>>>>>>>>>>>>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>>>>>>>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8
>>>>>>>>>>>>>>>>>>>>>> 0f 85
>>>>>>>>>>>>>>>> 56 ff ff
>>>>>>>>>>>>>>>>>>>>>> ff
>>>>>>>>>>>>>>>>>>>>>> 45 85 e4 0f
>>>>>>>>>>>>>>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00
>>>>>>>>>>>>>>>>>>>>> 00 48
>>>>>>>>>>>>>>>> 8b 40 10
>>>>>>>>>>>>>>>>>>>>> <48> 8b
>>>>>>>>>>>>>>>>>>> 98
>>>>>>>>>>>>>>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
>>>>>>>>>>>>>>>> 48 a8 01
>>>>>>>>>>>>>>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
>>>>>>>>>>>>>>>> 00010286 [
>>>>>>>>>>>>>>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX:
>>>>>>>> 0000000000000000
>>>>>>>>>>>>>>>> RCX:
>>>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0
>>>>>>>> RSI:
>>>>>>>>>>>>>>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224]
>>>>>>>> RBP:
>>>>>>>>>>>>>>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
>>>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>>>>> 449.807818] R10: 0000000000000000 R11:
>>>>>>>> 0000000000000148
>>>>>>>>>>>>>>>> R12:
>>>>>>>>>>>>>>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0
>>>>>>>> R14:
>>>>>>>>>>>>>>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [
>>>>>>>>>>>>>>>>>>>>>> 449.809004] FS:
>>>>>>>>>>>>>>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>>>>>>>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010
>>>>>>>>>>>>>>>>>>>>>> DS: 0000
>>>>>>>>>>>>>>>> ES: 0000
>>>>>>>>>>>>>>>>> CR0:
>>>>>>>>>>>>>>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0
>>>>>>>> CR3:
>>>>>>>>>>>>>>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747]
>>>>>>>> DR0:
>>>>>>>>>>>>>>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>>>>> 0000000000000000 [
>>>>>>>>>>>>>>>>>>>>> 449.811344] DR3: 0000000000000000 DR6:
>>>>>>>> 00000000fffe0ff0 DR7:
>>>>>>>>>>>>>>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>>>>>>>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140
>>>>>>>> [amdgpu]
>>>>>>>>>>>>>>>>>>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>>>>>>>> [amd_sched] [ 449.813139] ?
>>>>>>>>>>>>>>>>>>>>>> amdgpu_cgs_destroy_device+0x10/0x10
>>>>>>>>>>>>>>>> [amdgpu] [
>>>>>>>>>>>>>>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90
>>>>>>>>>>>>>>>> [amd_sched] [
>>>>>>>>>>>>>>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [
>>>>>>>>>>>>>>>>>>>>>> 449.814417]
>>>>>>>>>>>>>>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728]
>>>>>>>>>>>>>>>> kthread+0x121/0x140 [
>>>>>>>>>>>>>>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [
>>>>>>>>>>>>>>>> 449.815374] ?
>>>>>>>>>>>>>>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>>>>>>>>>>>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>>>>>>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I can't follow how it would be possible for
>>>>>>>>>>>>>>>> job->s_fence to be
>>>>>>>>>>>>>>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> So it looks like this patch is just papering
>>>>>>>>>>>>>>>>>>>>>>> over some
>>>>>>>>>>>>>>>> bigger issues.
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>> From: Koenig, Christian
>>>>>>>>>>>>>>>>>>>>>>>>> <Christian.Koenig at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-
>>>>>>>>>>>>>>>>>>>>> gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>> issue for
>>>>>>>>>>>>>>>>>>>>>>>>> tdr
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>>>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>>>>>>>>>>>>>>> You need to give me at least enough time to
>>>>>>>>>>>>>>>>>>>>>>>>> wake up
>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>>> From: amd-gfx
>>>>>>>>>>>>>>>> <amd-gfx-bounces at lists.freedesktop.org> On
>>>>>>>>>>>>>>>>>>> Behalf
>>>>>>>>>>>>>>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>> To: Koenig, Christian
>>>>>>>>>>>>>>>>>>>>>>>>>>> <Christian.Koenig at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: Christian König
>>>>>>>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng at amd.com>;
>>>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
>>>>>>>>>>>>>>>> pointer issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> When the job is already signaled, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> s_fence is
>>>>>>>>>>>>>>>> freed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then it will has null pointer in
>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the job
>>>>>>>>>>>>>>>> is destroyed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>>>>>>>>>>>>>>> I know it is set to NULL in
>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup.
>>>>>>>>>>>>>>>> But in one
>>>>>>>>>>>>>>>>>>>>>>>>>>> case, when it enter into the
>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover, it
>>>>>>>>>>>>>>>>>>>>>>>>>>> already in drm_sched_job_cleanup, and at
>>>>>>>>>>>>>>>>>>>>>>>>>>> this time,
>>>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>>>>>>>>>>>>>>> go to free
>>>>>>>>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
>>>>>>>>>>>>>>>> faster. At
>>>>>>>>>>>>>>>>>>>>>>>>>>> that time, job is not freed, but s_fence is
>>>>>>>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> NULL.
>>>>>>>>>>>>>>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_ring_priority_put(ring,
>>>>>>>>>>>>>>>>>>>>>>>>>> s_job->s_priority);
>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>>>>>>>>>>>>>>> The job itself is freed up directly after
>>>>>>>>>>>>>>>>>>>>>>>>> freeing the
>>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>> s_fence.
>>>>>>>>>>>>>>>>>>>>>>>>> So you are just papering over a much bigger
>>>>>>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>> here. This
>>>>>>>>>>>>>>>>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> When you see a job without an s_fence then
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> means the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <Emily.Deng at amd.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> | 2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c |
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>> ++++++---
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6
>>>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * job->base holds a reference to
>>>>>>>>>>>>>>>> parent fence
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>>>>>>>>>>>>>>>>>> &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence-
>>>>>>>>> parent))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>>>>>>>>>>>>>>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>>>>>>>>>>>>>>> &rq-
>>>>>>>>>>>>>>>>>>>>>>>> entities,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - if
>>>>>>>>>>>>>>>> (bad->s_fence->scheduled.context
>>>>>>>>>>>>>>>>>>>>>>> ==
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (bad->s_fence &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>>>>>>>>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> bad->sched-
>>>>>>>>>>>>>>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (entity-
>>>>>>>>>>>>>>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * This iteration is thread safe as
>>>>>>>>>>>>>>>> sched thread
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> stopped.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + if (s_job->s_fence &&
>>>>>>>>>>>>>>>> s_job->s_fence->parent &&
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>> @@ -
>>>>>>>>>>>>>>>>>>>>>>> 395,7
>>>>>>>>>>>>>>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>>>>>>>>>>>>>>> least 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>>>>>>>>>>>>>>>>>> finished,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>>> during @@
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>>>>>>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>>>>>>>> *sched,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - struct dma_fence *fence =
>>>>>>>>>>>>>>>> s_job->s_fence->parent;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + struct dma_fence *fence =
>>>>>>>>>>>>>>>> s_job->s_fence ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + s_job-
>>>>>>>>>>>>>>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
More information about the amd-gfx
mailing list