drm scheduler redesign causes deadlocks [extended repost]

Luben Tuikov ltuikov89 at gmail.com
Sat Nov 25 20:03:10 UTC 2023


On 2023-11-24 04:38, Bert Karwatzki wrote:
> Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
>> On 2023-11-21 04:00, Bert Karwatzki wrote:
>>> Since linux-next-20231115 my linux system (debian sid on msi alpha 15
>>> laptop)
>>> suffers from random deadlocks which can occur after  30 - 180min of usage.
>>> These
>>> deadlocks can be actively provoked by creating high system load (usually by
>>> compiling a kernel with make -j NRCPUS) and the opening instances of
>>> libreoffice
>>> --writer until the system GUI locks (the mouse cursor can still be moved but
>>> the
>>> screen is frozen). In this state ssh'ing into the machine is still possible
>>> and
>>> at least sometimes log messages about hung tasks appear in
>>> /var/log/kern.log.
>>>
>>> More info can be found here:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
>>>
>>> Using the method described to trigger the bug I bisected the problem in the
>>> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
>>> As this simple patch fixes the problem
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..25b97db1b623 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>>>                       struct drm_sched_entity *entity)
>>>  {
>>> -       if (drm_sched_entity_is_ready(entity))
>>> -               if (drm_sched_can_queue(sched, entity))
>>> -                       drm_sched_run_job_queue(sched);
>>> +       if (drm_sched_can_queue(sched, entity))
>>> +               drm_sched_run_job_queue(sched);
>>>  }
>>>  
>>>  /**
>>>
>>> there might be in the entity->dependency branch of drm_sched_entity_is_ready
>>> (some kind of circular dependencies ...).
>>>
>>> To see if the change to drm_sched_wakeup is the actual cause of the problem
>>> or
>>> if this problem has been cause by the redesign of the drm scheduler in linux
>>> next-20231115+, I created the following patch for linux-6.6.0:
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a42763e1429d..dc2abd299aeb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>  container_of(cb, struct drm_sched_entity, cb);
>>>
>>>  drm_sched_entity_clear_dep(f, cb);
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>
>>>  /**
>>> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
>>> *sched_job)
>>>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>  drm_sched_rq_update_fifo(entity, submit_ts);
>>>
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 5a3a622fc672..bbe06403b33d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
>>> drm_gpu_scheduler
>>> *sched)
>>>   *
>>>   * Wake up the scheduler if we can queue jobs.
>>>   */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity)
>>>  {
>>> - if (drm_sched_can_queue(sched))
>>> - wake_up_interruptible(&sched->wake_up_worker);
>>> + if(drm_sched_entity_is_ready(entity))
>>> + if (drm_sched_can_queue(sched))
>>> + wake_up_interruptible(&sched->wake_up_worker);
>>>  }
>>>
>>>  /**
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index ac65f0626cfc..6cfe3d193e69 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
>>> drm_sched_entity
>>> *entity,
>>>                                     unsigned int num_sched_list);
>>>
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity);
>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
>>> *bad);
>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>>>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>
>>> This brings the extra check to the old scheduler and has so far not caused
>>> any
>>> trouble (using the same stress test described above), so chances are that
>>> the
>>> error is somewhere else in redesigned scheduler.
>>>
>>>
>>> Bert Karwatzki
>>
>> Hi Bert,
>>
>> Thanks for looking into this.
>>
>> As an afterthought, removing the "entity_is_ready()" qualifier in wake-up,
>> makes
>> the scheduling more opportunistic, and I agree that that is the more correct
>> approach.
>> Commit f3123c2590005, basically made the code as close to the way it
>> functioned before
>> the move to run-queues.
>>
>> Would you like to create a patch for this?
> 
> Should I create the patch (is a simple revert of f3123c.. + explaining commit
> message enough?) for the drm-misc tree?

Yes, that'd be awesome!

Please CC all the people in this email, and also if you could --in-reply-to=
the last message in this thread with all the CC and all--an easy way to
do this is to just copy the "git send-email" line from lore.kernel.org. Thanks!

> Also, I did a little more experimentation: I applied f3123c25900 on top
> of commit a6149f0393699 and commit 35963cf2cd25e to see where the problems
> start. The result is that f3.. + a6.. leads to lock ups while f3.. + 35.. seems
> to be fine so the problems starts with the move from kthread to work queue.

Outstanding! Thank you Bert for doing all this testing.

So while the conclusion might be that a6 is the problem, it really is not.
Scheduling (CPU/GPU) is a fickle thing and care and deep thought needs to be exercised
when implementing it--it really is not a trivial thing.

See the thread responses starting with this email:
https://lore.kernel.org/all/d2bf144f-e388-4cb1-bc18-12efad4f677b@linux.intel.com/

Yes, to avoid lock-ups we should revert f3.

However, I think that we should go further and have drm_sched_wakeup() call
drm_sched_run_job_queue() unconditionally. drm_sched_run_job_work() does do its
own checks whether to continue or dequeue, and traditionally it's been done like this
to avoid races.

Could you also test with,

void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
		      struct drm_sched_entity *entity)
{
	drm_sched_run_job_queue(sched);
}

IOW, traditionally, a scheduler wake-up is uni-directional, i.e.,
	free up a resource,
	wake up thread waiting to use it;
The resource could be, for instance, adding an item to a queue, atomically, etc., etc.
What is important however, is that there is no decision making in this process.
The modified version of drm_sched_wakeup() above achieves this one-direction.

In our wake-up, we check whether to schedule work, and using the same
checks it does when after it's scheduled and running. The former could lead to problems. I did mention
in the linked thread above that the worst we get is an extra wake-up and extra check, in a string
of schedules, so the overhead would be 1/#(runs of drm_sched_run_job_work), i.e. the longer the string
of schedules, the smaller the overhead.

Thanks again!
-- 
Regards,
Luben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0x4C15479431A334AF.asc
Type: application/pgp-keys
Size: 664 bytes
Desc: OpenPGP public key
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20231125/7733ee84/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20231125/7733ee84/attachment-0001.sig>


More information about the dri-devel mailing list