amdgpu problem after kexec

Alexander E. Patrakov patrakov at gmail.com
Mon Feb 8 03:32:18 UTC 2021


чт, 4 февр. 2021 г. в 09:31, Alex Deucher <alexdeucher at gmail.com>:
>
> On Wed, Feb 3, 2021 at 7:56 PM Eric W. Biederman <ebiederm at xmission.com> wrote:
> >
> > Alex Deucher <alexdeucher at gmail.com> writes:
> >
> > > On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung at redhat.com> wrote:
> > >>
> > >> Hi Baoquan,
> > >>
> > >> Thanks for ccing.
> > >> On 01/28/21 at 01:29pm, Baoquan He wrote:
> > >> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
> > >> > > Hello,
> > >> > >
> > >> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
> > >> > > G6. The problem is, amdgpu does not have hardware acceleration after
> > >> > > kexec. Also, strangely, the lines about BlueTooth are missing from
> > >> > > dmesg after kexec, but I have not tried to use BlueTooth on this
> > >> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
> > >> > > in dmesg are:
> > >> > >
> > >> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
> > >> > > test failed on gfx (-110).
> > >> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
> > >> > >
> > >> > > The good and bad dmesg files are attached. Is it a kexec problem (and
> > >> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
> > >> > > need to provide some extra kernel arguments for debugging?
> >
> > The best debugging I can think of is can you arrange to have the amdgpu
> > modules removed before the final kexec -e?
> >
> > That would tell us if the code to shutdown the gpu exist in the rmmod
> > path aka the .remove method and is simply missing in the kexec path aka
> > the .shutdown method.
> >
> >
> > >> > I am not familiar with graphical component. Add Dave to CC to see if
> > >> > he has some comments. It would be great if amdgpu expert can have a look.
> > >>
> > >> It needs amdgpu driver people to help.  Since kexec bypass
> > >> bios/UEFI initialization so we requires drivers to implement .shutdown
> > >> method and test it to make 2nd kernel to work correctly.
> > >
> > > kexec is tricky to make work properly on our GPUs.  The problem is
> > > that there are some engines on the GPU that cannot be re-initialized
> > > once they have been initialized without an intervening device reset.
> > > APUs are even trickier because they share a lot of hardware state with
> > > the CPU.  Doing lots of extra resets adds latency.  The driver has
> > > code to try and detect if certain engines are running at driver load
> > > time and do a reset before initialization to make this work, but it
> > > apparently is not working properly on your system.
> >
> > There are two cases that I think sometimes get mixed up.
> >
> > There is kexec-on-panic in which case all of the work needs to happen in
> > the driver initialization.
> >
> > There is also a simple kexec in which case some of the work can happen
> > in the kernel that is being shutdown and sometimes that is easer.
> >
> > Does it make sense to reset your device unconditionally on driver removal?
>
> I think we tried that at some point in the past but users complained
> that it added latency or artifacts on the display at shutdown or
> reboot time.
>
> > Would it make sense to reset your device unconditionally on driver add?
>
> Pretty much the same issue there.  It adds latency and you get
> artifacts on the display when the reset happens.
>
> >
> > How can someone debug the smart logic of reset on driver load?
>
> See this block of code in amdgpu_device.c:
>         /* check if we need to reset the asic
>          *  E.g., driver was not cleanly unloaded previously, etc.
>          */
>     if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
>         r = amdgpu_asic_reset(adev);
>                 if (r) {
>                         dev_err(adev->dev, "asic reset on init failed\n");
>                         goto failed;
>                 }
>         }
>
> You'll want to see if amdgpu_asic_need_reset_on_init() was able to
> determine that the asic needs a reset.  If it does,
> amdgpu_asic_reset() getds called to reset it.
> The tricky thing is that some reset methods require a fair amount of
> driver state and so, they are only possible when the driver is up and
> running.  Those methods are not necessarily available at driver load
> time because we need to reset the GPU before we can initialize it and
> determine that state so we end up in a kind of catch 22.
> Unfortunately, generic PCI resets don't necessarily work on many of
> our GPUs so that's not an option either.
>
> Alex

Sorry for the delay with the reply, I was distracted.

Anyway, I managed to unload the amdgpu module successfully, using this
script (as /usr/lib/systemd/system-shutdown/debug.sh):

#!/bin/sh
mount -o remount,rw /
echo 0 > /sys/class/vtconsole/vtcon1/bind
rmmod amdgpu && echo '<4>==== Succeeded removing amdgpu module ====' > /dev/kmsg
dmesg > /var/log/shutdown-log-$(date +%Y%m%d-%H%M%S)
mount -o remount,ro /

At the end of a non-kexec boot, it logs this:

[  116.512621] Console: switching to colour dummy device 80x25
[  116.518591] amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device.
[  116.644899] [drm:dal_irq_service_dummy_ack [amdgpu]] *ERROR*
dal_irq_service_dummy_ack: called for non-implemented irq source
[  116.645168] [drm:dal_irq_service_dummy_set [amdgpu]] *ERROR*
dal_irq_service_dummy_set: called for non-implemented irq source
[  116.658515] [drm] free PSP TMR buffer
[  116.706265] [TTM] Zone  kernel: Used memory at exit: 0 KiB
[  116.706276] [TTM] Zone   dma32: Used memory at exit: 0 KiB
[  116.706280] [drm] amdgpu: ttm finalized
[  116.740460] ==== Succeeded removing amdgpu module ====

However, the next kexec-based boot still misses hardware acceleration.



--
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK


More information about the amd-gfx mailing list