hyper_bf soft lockup on Azure Gen2 VM when taking kdump or executing kexec

thomas.tai at oracle.com thomas.tai at oracle.com
Thu Feb 6 23:20:24 UTC 2025


On 2025-02-06 4:00 p.m., Michael Kelley wrote:
> From: Michael Kelley<mhklinux at outlook.com>
>> From: Thomas Tai<thomas.tai at oracle.com>  Sent: Thursday, January 30, 2025 12:44 PM
>>>> -----Original Message-----
>>>> From: Michael Kelley<mhklinux at outlook.com>  Sent: Thursday, January 30, 2025 3:20 PM
>>>>
>>>> From: Thomas Tai<thomas.tai at oracle.com>  Sent: Thursday, January 30,
>>>> 2025 10:50 AM
>>>>> Sorry for the typo in the subject title. It should have been 'hyperv_fb soft lockup on
>>>>> Azure Gen2 VM when taking kdump or executing kexec'
>>>>>
>>>>> Thomas
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> We see an issue with the mainline kernel on the Azure Gen 2 VM when
>>>>>> trying to induce a kernel panic with sysrq commands. The VM would hang
>>>>>> with soft lockup. A similar issue happens when executing kexec on the VM.
>>>>>> This issue is seen only with Gen2 VMs(with UEFI boot). Gen1 VMs with bios
>>>>>> boot are fine.
>>>>>>
>>>>>> git bisect identifies the issue is cased by the commit 20ee2ae8c5899
>>>>>> ("fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem()" ).
>>>>>> However, reverting the commit would cause the frame buffer not to work
>>>>>> on the Gen2 VM.
>>>>>>
>>>>>> Do you have any hints on what caused this issue?
>>>>>>
>>>>>> To reproduce the issue with kdump:
>>>>>> - Install mainline kernel on an Azure Gen 2 VM and trigger a kdump
>>>>>> - echo 1 > /proc/sys/kernel/sysrq
>>>>>> - echo c > /proc/sysrq-trigger
>>>>>>
>>>>>> To reproduce the issue with executing kexec:
>>>>>> - Install mainline kernel on Azure Gen 2 VM and use kexec
>>>>>> - sudo kexec -l /boot/vmlinuz --initrd=/boot/initramfs.img --command-
>>>>>> line="$( cat /proc/cmdline )"
>>>>>> - sudo kexec -e
>>>>>>
>>>>>> Thank you,
>>>>>> Thomas
>>>> I will take a look, but it might be early next week before I can do so.
>>>>
>>> Thank you, Michael for your help!
>>>
>>>> It looks like your soft lockup log below is from the kdump kernel (or the newly
>>>> kexec'ed kernel). Can you confirm? Also, this looks like a subset of the full log.
>>> Yes, the soft lockup log below is from the kdump kernel.
>>>
>>>> Do you have the full serial console log that you could email to me?  Seeing
>>>> everything might be helpful. Of course, I'll try to repro the problem myself
>>>> as well.
>>> I have attached the complete bootup and kdump kernel log.
>>>
>>> File: bootup_and_kdump.log
>>> Line 1 ... 984 (bootup log)
>>> Line 990       (kdump kernel booting up)
>>> Line 1351      (soft lockup)
>>>
>>> Thank you,
>>> Thomas
>>>
>> I have reproduced the problem in an Azure VM running Oracle Linux
>> 9.4 with the 6.13.0 kernel. Interestingly, the problem does not occur
>> in a VM running on a locally installed Hyper-V with Ubuntu 20.04 and
>> the 6.13.0 kernel. There are several differences in the two
>> environments:  the version of Hyper-V, the VM configuration, the Linux
>> distro, and the .config file used to build the 6.13.0 kernel. I'll try to
>> figure out what make the difference, and then the root cause.
>>
> This has been a real bear to investigate. :-(  The key observation
> is that with older kernel versions, the efifb driver does *not* try
> to load when running in the kdump kernel, and everything works.
> In newer kernels, the efifb driver *does* try to load, and it appears
> to hang.  (Actually, it is causing the VM to run very slowly. More on
> that in a minute.)
>
> I've bisected the kernel again, compensating for the fact that commit
> 20ee2ae8c5899 is needed to make the Hyper-V frame buffer work. With
> that compensation, the actual problematic commit is 2bebc3cd4870
> (Revert "firmware/sysfb: Clear screen_info state after consuming it").
> Doing the revert causes screen_info.orig_video_isVGA to retain its value
> of 0x70 (VIDEO_TYPE_EFI), which the kdump kernel picks up, causing it
> to load the efifb driver.
>
> Then the question is why the efifb driver doesn't work in the kdump
> kernel. Actually, it *does* work in many cases. I built the 6.13.0 kernel
> on the Oracle Linux 9.4 system, and transferred the kernel image binary
> and module binaries to an Ubuntu 20.04 VM in Azure. In that VM, the
> efifb driver is loaded as part of the kdump kernel, and it doesn't cause
> any problems. But there's an interesting difference. In the Oracle Linux
> 9.4 VM, the efifb driver finds the framebuffer at 0x40000000, while on
> the Ubuntu 20.04 VM, it finds the framebuffer at 0x40900000. This
> difference is due to differences in how the screen_info variable gets
> setup in the two VMs.
>
> When the normal kernel starts in a freshly booted VM, Hyper-V provides
> the EFI framebuffer at 0x40000000, and it works. But after the Hyper-V
> FB driver or Hyper-V DRM driver has initialized, Linux has picked a
> different MMIO address range and told Hyper-V to use the new
> address range (which often starts at 0x40900000). A kexec does *not*
> reset Hyper-V's transition to the new range, so when the efifb driver
> tries to use the framebuffer at 0x40000000, the accesses trap to
> Hyper-V and probably fail or timeout (I'm not sure of the details). After
> the guest does some number of these bad references, Hyper-V considers
> itself to be under attack from an ill-behaving guest, and throttles the
> guest so that it doesn't run for a few seconds. The throttling repeats,
> and results in extremely slow running in the kdump kernel.
>
> Somehow in the Ubuntu 20.04 VM, the location of the frame buffer
> as stored in screen_info.lfb_base gets updated to be 0x40900000. I
> haven't fully debugged how that happens. But with that update, the
> efifb driver is using the updated framebuffer address and it works. On
> the Oracle Linux 9.4 system, that update doesn't appear to happen,
> and the problem occurs.rent between Ubuntu and Oracle Linux.

Thanks, Michael, for the detailed information. I will ask if anyone in 
Oracle can

help debug the difference between Ubuntu and Oracle that may have caused 
the issue.

Thomas


>
> This in an interim update on the problem. I'm still investigating how
> screen_info.lfb_base is set in the kdump kernel, and why it is different
> in the Ubuntu 20.04 VM vs. in the Oracle Linux 9.4 VM. Once that is
> well understood, we can contemplate how to fix the problem. Undoing
> the revert that is commit 2bebc3cd4870 doesn't seem like the solution
> since the original code there was reported to cause many other issues.
> The solution focus will likely be on how to ensure the kdump kernel gets
> the correct framebuffer address so the efifb driver works, since the
> framebuffer address changing is a quirk of Hyper-V behavior.
>
> If anyone else has insight into what's going on here, please chime in.
> What I've learned so far is still somewhat tentative.
>
> Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20250206/e5f0f87d/attachment-0001.htm>


More information about the dri-devel mailing list