[PATCH 0/2] Recover from failure to probe GPU

Christian König christian.koenig at amd.com
Sun Dec 25 15:30:49 UTC 2022


Am 24.12.22 um 10:34 schrieb Thomas Zimmermann:
> Hi
>
> Am 22.12.22 um 19:30 schrieb Mario Limonciello:
>> One of the first thing that KMS drivers do during initialization is
>> destroy the system firmware framebuffer by means of
>> `drm_aperture_remove_conflicting_pci_framebuffers`
>>
>> This means that if for any reason the GPU failed to probe the user
>> will be stuck with at best a screen frozen at the last thing that
>> was shown before the KMS driver continued it's probe.
>>
>> The problem is most pronounced when new GPU support is introduced
>> because users will need to have a recent linux-firmware snapshot
>> on their system when they boot a kernel with matching support.
>>
>> However the problem is further exaggerated in the case of amdgpu because
>> it has migrated to "IP discovery" where amdgpu will attempt to load
>> on "ALL" AMD GPUs even if the driver is missing support for IP blocks
>> contained in that GPU.
>>
>> IP discovery requires some probing and isn't run until after the
>> framebuffer has been destroyed.
>>
>> This means a situation can occur where a user purchases a new GPU not
>> yet supported by a distribution and when booting the installer it will
>> "freeze" even if the distribution doesn't have the matching kernel 
>> support
>> for those IP blocks.
>>
>> The perfect example of this is Ubuntu 21.10 and the new dGPUs just
>> launched by AMD.  The installation media ships with kernel 5.19 (which
>> has IP discovery) but the amdgpu support for those IP blocks landed in
>> kernel 6.0. The matching linux-firmware was released after 21.10's 
>> launch.
>> The screen will freeze without nomodeset. Even if a user manages to 
>> install
>> and then upgrades to kernel 6.0 after install they'll still have the
>> problem of missing firmware, and the same experience.
>>
>> This is quite jarring for users, particularly if they don't know
>> that they have to use "nomodeset" to install.
>>
>> To help the situation, allow drivers to re-run the init process for the
>> firmware framebuffer during a failed probe. As this problem is most
>> pronounced with amdgpu, this is the only driver changed.
>>
>> But if this makes sense more generally for other KMS drivers, the call
>> can be added to the cleanup routine for those too.
>
> Just a quick drive-by comment: as Javier noted, at some point while 
> probing, your driver has changed the device' state and the system FB 
> will be gone. you cannot reestablish the sysfb after that.

I was about to note exactly that as well. This effort here is 
unfortunately pretty pointless.

>
> You are, however free to read device state at any time, as long as it 
> has no side effects.
>
> So why not just move the call to 
> drm_aperture_remove_conflicting_pci_framebuffers() to a later point 
> when you know that your driver supports the hardware? That's the 
> solution we always proposed to this kind of problem. It's safe and 
> won't require any changes to the aperture helpers.

if I'm not completely mistaken that's a little bit tricky. Currently 
it's not possible to read the discovery table before disabling the VGA 
and/or current framebuffer.

We might be able to do this, but it's probably not easy.

Regards,
Christian.


>
> Best regards
> Thomas
>
>>
>> Here is a sample of what happens with missing GPU firmware and this
>> series:
>>
>> [    5.950056] amdgpu 0000:63:00.0: vgaarb: deactivate vga console
>> [    5.950114] amdgpu 0000:63:00.0: enabling device (0006 -> 0007)
>> [    5.950883] [drm] initializing kernel modesetting (YELLOW_CARP 
>> 0x1002:0x1681 0x17AA:0x22F1 0xD2).
>> [    5.952954] [drm] register mmio base: 0xB0A00000
>> [    5.952958] [drm] register mmio size: 524288
>> [    5.954633] [drm] add ip block number 0 <nv_common>
>> [    5.954636] [drm] add ip block number 1 <gmc_v10_0>
>> [    5.954637] [drm] add ip block number 2 <navi10_ih>
>> [    5.954638] [drm] add ip block number 3 <psp>
>> [    5.954639] [drm] add ip block number 4 <smu>
>> [    5.954641] [drm] add ip block number 5 <dm>
>> [    5.954642] [drm] add ip block number 6 <gfx_v10_0>
>> [    5.954643] [drm] add ip block number 7 <sdma_v5_2>
>> [    5.954644] [drm] add ip block number 8 <vcn_v3_0>
>> [    5.954645] [drm] add ip block number 9 <jpeg_v3_0>
>> [    5.954663] amdgpu 0000:63:00.0: amdgpu: Fetched VBIOS from VFCT
>> [    5.954666] amdgpu: ATOM BIOS: 113-REMBRANDT-X37
>> [    5.954677] [drm] VCN(0) decode is enabled in VM mode
>> [    5.954678] [drm] VCN(0) encode is enabled in VM mode
>> [    5.954680] [drm] JPEG decode is enabled in VM mode
>> [    5.954681] amdgpu 0000:63:00.0: amdgpu: Trusted Memory Zone (TMZ) 
>> feature disabled as experimental (default)
>> [    5.954683] amdgpu 0000:63:00.0: amdgpu: PCIE atomic ops is not 
>> supported
>> [    5.954724] [drm] vm size is 262144 GB, 4 levels, block size is 
>> 9-bit, fragment size is 9-bit
>> [    5.954732] amdgpu 0000:63:00.0: amdgpu: VRAM: 512M 
>> 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
>> [    5.954735] amdgpu 0000:63:00.0: amdgpu: GART: 1024M 
>> 0x0000000000000000 - 0x000000003FFFFFFF
>> [    5.954738] amdgpu 0000:63:00.0: amdgpu: AGP: 267419648M 
>> 0x000000F800000000 - 0x0000FFFFFFFFFFFF
>> [    5.954747] [drm] Detected VRAM RAM=512M, BAR=512M
>> [    5.954750] [drm] RAM width 256bits LPDDR5
>> [    5.954834] [drm] amdgpu: 512M of VRAM memory ready
>> [    5.954838] [drm] amdgpu: 15680M of GTT memory ready.
>> [    5.954873] [drm] GART: num cpu pages 262144, num gpu pages 262144
>> [    5.955333] [drm] PCIE GART of 1024M enabled (table at 
>> 0x000000F41FC00000).
>> [    5.955502] amdgpu 0000:63:00.0: Direct firmware load for 
>> amdgpu/yellow_carp_toc.bin failed with error -2
>> [    5.955505] amdgpu 0000:63:00.0: amdgpu: fail to request/validate 
>> toc microcode
>> [    5.955510] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp 
>> firmware!
>> [    5.955725] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init 
>> of IP block <psp> failed -2
>> [    5.955952] amdgpu 0000:63:00.0: amdgpu: amdgpu_device_ip_init failed
>> [    5.955954] amdgpu 0000:63:00.0: amdgpu: Fatal error during GPU init
>> [    5.955957] amdgpu 0000:63:00.0: amdgpu: amdgpu: finishing device.
>> [    5.971162] efifb: probing for efifb
>> [    5.971281] efifb: showing boot graphics
>> [    5.974803] efifb: framebuffer at 0x910000000, using 20252k, total 
>> 20250k
>> [    5.974805] efifb: mode is 2880x1800x32, linelength=11520, pages=1
>> [    5.974807] efifb: scrolling: redraw
>> [    5.974807] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
>> [    5.974974] Console: switching to colour frame buffer device 180x56
>> [    5.978181] fb0: EFI VGA frame buffer device
>> [    5.978199] amdgpu: probe of 0000:63:00.0 failed with error -2
>> [    5.978285] [drm] amdgpu: ttm finalized
>>
>> Now if the user loads the firmware into the system they can re-load the
>> driver or re-attach using sysfs and it gracefully recovers.
>>
>> [  665.080480] [drm] Initialized amdgpu 3.49.0 20150101 for 
>> 0000:63:00.0 on minor 0
>> [  665.090075] fbcon: amdgpudrmfb (fb0) is primary device
>> [  665.090248] [drm] DSC precompute is not needed.
>>
>> Mario Limonciello (2):
>>    firmware: sysfb: Allow re-creating system framebuffer after init
>>    drm/amd: Re-create firmware framebuffer on failure to probe
>>
>>   drivers/firmware/efi/sysfb_efi.c        |  6 +++---
>>   drivers/firmware/sysfb.c                | 15 ++++++++++++++-
>>   drivers/firmware/sysfb_simplefb.c       |  4 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
>>   include/linux/sysfb.h                   |  5 +++++
>>   5 files changed, 26 insertions(+), 6 deletions(-)
>>
>>
>> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
>



More information about the amd-gfx mailing list