[PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
Christian König
christian.koenig at amd.com
Mon Jan 13 09:55:08 UTC 2025
Am 13.01.25 um 09:43 schrieb Philipp Stanner:
> [SNIP]
>>> The handling of NULL values is half-baked.
>>>
>>> In my opinion, you should define if drm_sched_pick_best() may put a
>>> NULL into
>>> rq. If your answer is yes, it might put a NULL there; then, there
>>> should be a
>>> BUG_ON(!entity->rq) after the invocation of
>>> drm_sched_entity_select_rq().
>>> If your answer is no, the BUG_ON() should be in
>>> drm_sched_pick_best().
>> Yeah good point.
>>
>> We might not want a BUG_ON(), that is only justified when we prevent
>> further damage (e.g. random data corruption or similar).
>>
>> I suggest using a WARN(!shed, "Submission without activated
>> sheduler!").
>> This way the system has at least a chance of survival should the
>> scheduler become ready later on.
>>
>> On the other hand the BUG_ON() or the NULL pointer deref should only
>> kill the application thread which is submitting something before the
>> driver is resumed. So that might help to pinpoint where the actually
>> issue is.
> As I see it the BUG_ON() would just be a more pretty NULL pointer
> deref. If we agree that this is effectively a misuse of the scheduler
> API we probably want to add it to make it more pretty, though?
The only alternative I can see is that the scheduler API gracefully
handles submits to non-ready schedulers. E.g. that
drm_sched_entity_push_job() detects this condition and instead of
pushing the job sets and error code and signals the fences.
But that might not be a good idea.
It just moves the crash from one place to another and in general I fully
agree the driver is misusing the scheduler API to do something which
won't work and potentially crash the whole system.
> @Philipp:
> BTW, I only just discovered this thread by coincidence. Please use
> get_maintainer. The scheduler currently has 4 maintainers, and none of
> them is on CC.
Oh good, point I was already wondering why nobody else commented and
didn't realized that nobody was on CC.
Thanks,
Christian.
>
> Danke,
> P.
>
>> Regards,
>> Christian.
>>
>>> That helps guys with zero domain knowledge, like me, to figure out
>>> how
>>> this is all
>>> supposed to work.
>>>
>>> best regards,
>>> Philipp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20250113/6aa80f79/attachment.htm>
More information about the dri-devel
mailing list