[RFC] How to test panic handlers, without crashing the kernel

Jocelyn Falempe jfalempe at redhat.com
Tue Mar 5 16:31:46 UTC 2024



On 04/03/2024 22:12, John Ogness wrote:
> [Added printk maintainer and kdb folks]
> 
> Hi Jocelyn,
> 
> On 2024-03-01, Jocelyn Falempe <jfalempe at redhat.com> wrote:
>> While writing a panic handler for drm devices [1], I needed a way to
>> test it without crashing the machine.
>> So from debugfs, I called
>> atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
>> side effect of calling all other panic notifiers registered.
>>
>> So Sima suggested to move that to the generic panic code, and test all
>> panic notifiers with a dedicated debugfs interface.
>>
>> I can move that code to kernel/, but before doing that, I would like to
>> know if you think that's the right way to test the panic code.
> 
> One major event that happens before the panic notifiers is
> panic_other_cpus_shutdown(). This can cause special situations because
> CPUs can be stopped while holding resources (such as raw spin
> locks). And these are the situations that make it so tricky to have safe
> and reliable notifiers. If triggered from debugfs, these situations will
> never occur.
> 
> My concern is that the tests via debugfs will always succeed, but in the
> real world panic notifiers are failing/hanging/exploding. IMHO useful
> panic testing requires real panic'ing.

Yes, but for the drm panic, it's still useful to check that the output 
is working (ie: make sure the color format and the framebuffer address 
are good). Also I've reworked the debugfs patch, so I don't have to call 
all panic notifiers. It's now per device, so your can trigger the 
drm_panic handler on a specific GPU.

> 
> For my printk panic tests I trigger unknown NMIs while booting with
> "unknown_nmi_panic". Particularly with Qemu this is quite easy and
> amazingly effective at catching problems. In fact, a recent printk
> series [0] fixed seven issues that were found through this method of
> panic testing.

Thanks for this tip, I used to test with "echo c > /proc/sysrq-trigger" 
in the guest, but that's more permissive. I'm now testing with virsh 
inject-nmi, and drm_panic is still working.
> 
>> The second question is how to simulate a panic context in a
>> non-destructive way, so we can test the panic notifiers in CI, without
>> crashing the machine.
> 
> I'm wondering if a "fake panic" can be implemented that quiesces all the
> other CPUs via NMI (similar to kdb) and then calls the panic
> notifiers. And finally releases everything back to normal. That might
> produce a fairly realistic panic situation and should be fairly
> non-destructive (depending on what the notifiers do and how long they
> take).
> 
>> The worst case for a panic notifier, is when the panic occurs in NMI
>> context, but I don't know how to simulate that. The goal would be to
>> find early if a panic notifier tries to sleep, or do other things that
>> are not allowed in a panic context.
> 
> Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers
> the fake panic instead?
> 
> John Ogness
> 
> [0] https://lore.kernel.org/lkml/20240207134103.1357162-1-john.ogness@linutronix.de
> 

Best regards,

-- 

Jocelyn



More information about the dri-devel mailing list