[PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Wed May 28 09:55:27 UTC 2025

On Mon Jan 13, 2025 at 1:55 AM PST, Christian König wrote:
> Am 13.01.25 um 09:43 schrieb Philipp Stanner:
>> [SNIP]
>>>> The handling of NULL values is half-baked.
>>>>
>>>> In my opinion, you should define if drm_sched_pick_best() may put a
>>>> NULL into
>>>> rq. If your answer is yes, it might put a NULL there; then, there
>>>> should be a
>>>> BUG_ON(!entity->rq) after the invocation of
>>>> drm_sched_entity_select_rq().
>>>> If your answer is no, the BUG_ON() should be in
>>>> drm_sched_pick_best().
>>> Yeah good point.
>>>
>>> We might not want a BUG_ON(), that is only justified when we prevent
>>> further damage (e.g. random data corruption or similar).
>>>
>>> I suggest using a WARN(!shed, "Submission without activated
>>> sheduler!").
>>> This way the system has at least a chance of survival should the
>>> scheduler become ready later on.
>>>
>>> On the other hand the BUG_ON() or the NULL pointer deref should only
>>> kill the application thread which is submitting something before the
>>> driver is resumed. So that might help to pinpoint where the actually
>>> issue is.
>> As I see it the BUG_ON() would just be a more pretty NULL pointer
>> deref. If we agree that this is effectively a misuse of the scheduler
>> API we probably want to add it to make it more pretty, though?
>
> The only alternative I can see is that the scheduler API gracefully 
> handles submits to non-ready schedulers. E.g. that 
> drm_sched_entity_push_job() detects this condition and instead of 
> pushing the job sets and error code and signals the fences.
>
> But that might not be a good idea.
>
> It just moves the crash from one place to another and in general I fully 
> agree the driver is misusing the scheduler API to do something which 
> won't work and potentially crash the whole system.
>
>> @Philipp:
>> BTW, I only just discovered this thread by coincidence. Please use
>> get_maintainer. The scheduler currently has 4 maintainers, and none of
>> them is on CC.
>
> Oh good, point I was already wondering why nobody else commented and 
> didn't realized that nobody was on CC.
>
> Thanks,
> Christian.

I'm only seeing this mail exchange months after the fact because I was
linked to it by someone on IRC, and I am making a wild guess here.

Could this sleep wake issue also be caused by a similar thing to the
panics and SMU hangs I was experiencing with my own issue? It's an issue
known to have the same workaround for both 6000 and 7000 series users. A
specific kernel commit seems to affect it as well.

If you could test whether you can still reproduce the error after
disabling GFXOFF states with the following kernel commandline override:

amdgpu.ppfeaturemask=0xfff73fff

And report back. Unless it's already something long solved? Since this
particular thread died back in January, I guess nothing has happened
since?

>
>>
>> Danke,
>> P.
>>
>>> Regards,
>>> Christian.
>>>
>>>> That helps guys with zero domain knowledge, like me, to figure out
>>>> how
>>>> this is all
>>>> supposed to work.
>>>>
>>>> best regards,
>>>>    Philipp