[PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Christian König ckoenig.leichtzumerken at gmail.com
Wed Nov 10 10:09:50 UTC 2021


Am 10.11.21 um 10:50 schrieb Daniel Vetter:
> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
>> On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>>> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
>>>> I stumbled across this thread when I ran into the same issue, while
>>>> working out how to move drm/msm to use scheduler's retire +
>>>> timeout/recovery (and get rid of our own mirror list of in-flight
>>>> jobs).  We already have hw error detection enabled, and it can signal
>>>> quite fast, so assuming the first job on the list is the guilty job
>>>> just won't work.
>>>>
>>>> But I was considering a slightly different approach to fixing this,
>>>> instead just handling it all in drm_sched_main() and getting rid of
>>>> the complicated kthread parking gymnastics.  Ie. something along the
>>>> lines of:
>>> So handling timeouts in the main sched thread wont work as soon as you
>>> have multiple engines and reset that impacts across engines:
>>>
>>> - Nothing is simplified since you still need to stop the other scheduler
>>>    threads.
>>>
>>> - You get deadlocks if 2 schedulers time out at the same time, and both
>>>    want to stop the other one.
>>>
>>> Hence workqueue. Now the rule for the wq is that you can only have one per
>>> reset domain, so
>>> - single engine you just take the one drm/sched provides
>>> - if reset affects all your engines in the chip, then you allocate on in
>>>    the drm_device and pass that to all
>>> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>>>    amd), then it's one wq for the entire hive
>>>
>>> _All_ reset related things must be run on that workqueue or things breaks,
>>> which means if you get hw fault that also needs to be run there. I guess
>>> we should either patch drm/sched to check you call that function from the
>>> right workqueue, or just handle it internally.
>> Hmm, ok.. I guess it would be useful to better document the reasoning
>> for the current design, that would have steered me more towards the
>> approach taken in this patch.
> Maybe this was because you worked on an old kernel? Boris did update the
> kerneldoc as part of making gpu reset work for panfrost, which has this
> multi-engine reset problem. If that's not yet clear then we need to
> improve the docs further.
>
> AMD's problem is even worse, because their reset domain is the entire xgmi
> hive, so multiple pci devices.

I'm pushing for quite a while that we get something like an 
amdgpu_reset_domain structure or similar for this, but we unfortunately 
don't have that yet.

Maybe it should be a good idea to have something like a drm_sched_domain 
or similar with all the necessary information for the inter scheduler 
handling.

E.g. a workqueue for reset etc...

Regards,
Christian.

>
> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
> ordering/barriers and I'm pretty sure a lot are still missing. Or at least
> we should have comments in the code explaining why it all works.
> -Daniel
>
>> BR,
>> -R
>>
>>> -Daniel
>>>
>>>> ---------------------
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 67382621b429..4d6ce775c316 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
>>>> drm_gpu_scheduler *sched)
>>>>          return false;
>>>>   }
>>>>
>>>> +static bool handle_timeout(struct drm_gpu_scheduler *sched)
>>>> +{
>>>> +       struct drm_sched_job *bad;
>>>> +
>>>> +       if (!sched->has_timeout)
>>>> +               return false;
>>>> +
>>>> +       sched->has_timeout = false;
>>>> +
>>>> +       spin_lock(&sched->job_list_lock);
>>>> +       bad = list_first_entry_or_null(&sched->pending_list,
>>>> +                                      struct drm_sched_job, list);
>>>> +
>>>> +       if (!bad) {
>>>> +               spin_unlock(&sched->job_list_lock);
>>>> +               return false;
>>>> +       }
>>>> +
>>>> +       spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +       if (sched->timeout_wq == system_wq) {
>>>> +               /*
>>>> +                * If driver has no specific requirements about serializing
>>>> +                * reset wrt. other engines, just call timedout_job() directly
>>>> +                */
>>>> +               sched->ops->timedout_job(job);
>>>> +       } else {
>>>> +               /*
>>>> +                * Otherwise queue it on timeout_wq and wait for it to complete
>>>> +                */
>>>> +               ... more typing needed here ...
>>>> +       }
>>>> +
>>>> +       if (sched->free_guilty) {
>>>> +               sched->ops->free_job(job);
>>>> +               sched->free_guilty = false;
>>>> +       }
>>>> +}
>>>> +
>>>>   /**
>>>>    * drm_sched_main - main scheduler thread
>>>>    *
>>>> @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
>>>>
>>>>                  wait_event_interruptible(sched->wake_up_worker,
>>>>                                           (cleanup_job =
>>>> drm_sched_get_cleanup_job(sched)) ||
>>>> +                                        handle_timeout(sched) ||
>>>>                                           (!drm_sched_blocked(sched) &&
>>>>                                            (entity =
>>>> drm_sched_select_entity(sched))) ||
>>>>                                           kthread_should_stop());
>>>> ---------------------
>>>>
>>>> drm_sched_fault() and the sw timeout handler would just set
>>>> sched->has_timeout and kick sched->wake_up_worker.
>>>>
>>>> And since we handle the timeout case after
>>>> drm_sched_get_cleanup_job(), we know that all of the successfully
>>>> completed jobs have already been popped off the list, and won't be
>>>> unfairly maligned.
>>>>
>>>> BR,
>>>> -R
>>>>
>>>> On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu at amd.com> wrote:
>>>>> [AMD Official Use Only]
>>>>>
>>>>> Okay, I will reprepare this patch
>>>>>
>>>>> Thanks
>>>>>
>>>>> ------------------------------------------
>>>>> Monk Liu | Cloud-GPU Core team
>>>>> ------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Daniel Vetter <daniel at ffwll.ch>
>>>>> Sent: Tuesday, August 31, 2021 9:02 PM
>>>>> To: Liu, Monk <Monk.Liu at amd.com>
>>>>> Cc: amd-gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen at amd.com>
>>>>> Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>>>>
>>>>> On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
>>>>>> Can we please have some actual commit message here, with detailed
>>>>>> explanation of the race/bug/whatever, how you fix it and why this is
>>>>>> the best option?
>>>>>>
>>>>>> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>>>>>>> tested-by: jingwen chen <jingwen.chen at amd.com>
>>>>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>>>>> Signed-off-by: jingwen chen <jingwen.chen at amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c | 24
>>>>>>> ++++--------------------
>>>>>>>   1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index ecf8140..894fdb24 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>      sched = container_of(work, struct drm_gpu_scheduler,
>>>>>>> work_tdr.work);
>>>>>>>
>>>>>>>      /* Protects against concurrent deletion in
>>>>>>> drm_sched_get_cleanup_job */
>>>>>>> +   if (!__kthread_should_park(sched->thread))
>>>>>> This is a __ function, i.e. considered internal, and it's lockless
>>>>>> atomic, i.e. unordered. And you're not explaining why this works.
>>>>>>
>>>>>> Iow it's probably buggy, and an just unconditionally parking the
>>>>>> kthread is probably the right thing to do. If it's not the right thing
>>>>>> to do, there's a bug here for sure.
>>>>> Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
>>>>> -Daniel
>>>>>
>>>>>>> +           kthread_park(sched->thread);
>>>>>>> +
>>>>>>>      spin_lock(&sched->job_list_lock);
>>>>>>>      job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>                                     struct drm_sched_job, list);
>>>>>>>
>>>>>>>      if (job) {
>>>>>>> -           /*
>>>>>>> -            * Remove the bad job so it cannot be freed by concurrent
>>>>>>> -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>> -            * is parked at which point it's safe.
>>>>>>> -            */
>>>>>>> -           list_del_init(&job->list);
>>>>>>>              spin_unlock(&sched->job_list_lock);
>>>>>>>
>>>>>>> +           /* vendor's timeout_job should call drm_sched_start() */
>>>>>>>              status = job->sched->ops->timedout_job(job);
>>>>>>>
>>>>>>>              /*
>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>      kthread_park(sched->thread);
>>>>>>>
>>>>>>>      /*
>>>>>>> -    * Reinsert back the bad job here - now it's safe as
>>>>>>> -    * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>> -    * bad job at this point - we parked (waited for) any in progress
>>>>>>> -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>> -    * now until the scheduler thread is unparked.
>>>>>>> -    */
>>>>>>> -   if (bad && bad->sched == sched)
>>>>>>> -           /*
>>>>>>> -            * Add at the head of the queue to reflect it was the earliest
>>>>>>> -            * job extracted.
>>>>>>> -            */
>>>>>>> -           list_add(&bad->list, &sched->pending_list);
>>>>>>> -
>>>>>>> -   /*
>>>>>>>       * Iterate the job list from later to  earlier one and either deactive
>>>>>>>       * their HW callbacks or remove them from pending list if they already
>>>>>>>       * signaled.
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>> --
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
>>>>>> ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
>>>>>> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
>>>>>> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
>>>>>> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLg
>>>>>> KeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch



More information about the amd-gfx mailing list