[PATCH 0/2] Recover from failure to probe GPU

Fri Dec 23 15:51:05 UTC 2022

On 12/22/22 13:41, Javier Martinez Canillas wrote:
> [adding Thomas Zimmermann to CC list]
> 
> Hello Mario,
> 
> Interesting case.
> 
> On 12/22/22 19:30, Mario Limonciello wrote:
>> One of the first thing that KMS drivers do during initialization is
>> destroy the system firmware framebuffer by means of
>> `drm_aperture_remove_conflicting_pci_framebuffers`
>>
> 
> The reason why that's done at the very beginning is that there are no
> guarantees that the firmware-provided framebuffer would keep working
> after the real display controller driver re-initializes the IP block.
> 
>> This means that if for any reason the GPU failed to probe the user
>> will be stuck with at best a screen frozen at the last thing that
>> was shown before the KMS driver continued it's probe.
>>
>> The problem is most pronounced when new GPU support is introduced
>> because users will need to have a recent linux-firmware snapshot
>> on their system when they boot a kernel with matching support.
>>
> 
> Right. That's a problem indeed but as mentioned there's a gap between
> the firmware-provided framebuffer is removed and the real driver sets
> up its framebuffer.
>   
>> However the problem is further exaggerated in the case of amdgpu because
>> it has migrated to "IP discovery" where amdgpu will attempt to load
>> on "ALL" AMD GPUs even if the driver is missing support for IP blocks
>> contained in that GPU.
>>
>> IP discovery requires some probing and isn't run until after the
>> framebuffer has been destroyed.
>>
>> This means a situation can occur where a user purchases a new GPU not
>> yet supported by a distribution and when booting the installer it will
>> "freeze" even if the distribution doesn't have the matching kernel support
>> for those IP blocks.
>>
>> The perfect example of this is Ubuntu 21.10 and the new dGPUs just
>> launched by AMD.  The installation media ships with kernel 5.19 (which
>> has IP discovery) but the amdgpu support for those IP blocks landed in
>> kernel 6.0. The matching linux-firmware was released after 21.10's launch.
>> The screen will freeze without nomodeset. Even if a user manages to install
>> and then upgrades to kernel 6.0 after install they'll still have the
>> problem of missing firmware, and the same experience.

s/21.10/22.10/

>>
>> This is quite jarring for users, particularly if they don't know
>> that they have to use "nomodeset" to install.
>>
> 
> I'm not familiar with AMD GPUs, but could be possible that this discovery
> and firmware loading step be done at the beginning before the firmware FB
> is removed ? That way the FB removal will not happen unless that succeeds.

Possible?  I think so, but maybe Alex can comment on this after the 
holidays as he's more familiar.

It would mean splitting and introducing an entirely new phase to driver 
initialization.  The information about the discovery table comes from VRAM.

amdgpu_driver_load_kms -> amdgpu_device_init -> amdgpu_device_ip_early_init

Basically that code specific would have to call earlier and then there 
would need to be a separate set of code for all the IP blocks to *just* 
collect what firmware they need.

>   
>> To help the situation, allow drivers to re-run the init process for the
>> firmware framebuffer during a failed probe. As this problem is most
>> pronounced with amdgpu, this is the only driver changed.
>>
>> But if this makes sense more generally for other KMS drivers, the call
>> can be added to the cleanup routine for those too.
>>
> 
> The problem I see is that depending on how far the driver's probe function
> went, there may not be possible to re-run the init process. Since firmware
> provided framebuffer may already been destroyed or the IP block just be in
> a half initialized state.
> 
> I'm not against this series if it solves the issue in practice for amdgpu,
> but don't think is a general solution and would like to know Thomas' opinion
> on this before as well

Running on this idea I'm pretty sure that request_firmware returns 
-ENOENT in this case. So another proposal for when to trigger this flow 
would be to only do it on -ENOENT.  We could then also change 
amdgpu_discovery.c to return -ENOENT when an IP block isn't supported 
instead of the current -EINVAL.

Or we could instead co-opt -ENOTSUPP and remap all the cases that we 
explicitly want the system framebuffer to re-initialize to that.