[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output

Wed Mar 13 03:53:14 UTC 2019

Hi,

On Tue, Mar 12, 2019 at 11:58:10AM +0100, Daniel Vetter wrote:
> On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> >
> >
> > Den 11.03.2019 20.23, skrev Daniel Vetter:
> > > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
[...]
> > >> +}
> > >> +
> > >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> > >> +	.dump = drm_panic_kmsg_dump,
> > >> +	.max_reason = KMSG_DUMP_PANIC,
> > >> +};
> > >> +
> > >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> > >> +					  const char __user *user_buf,
> > >> +					  size_t count, loff_t *ppos)
> > >> +{
> > >> +	unsigned long long val;
> > >> +	char buf[24];
> > >> +	size_t size;
> > >> +	ssize_t ret;
> > >> +
> > >> +	size = min(sizeof(buf) - 1, count);
> > >> +	if (copy_from_user(buf, user_buf, size))
> > >> +		return -EFAULT;
> > >> +
> > >> +	buf[size] = '\0';
> > >> +	ret = kstrtoull(buf, 0, &val);
> > >> +	if (ret)
> > >> +		return ret;
> > >> +
> > >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> > >> +	wmb();
> > >> +
> > >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> > >> +
> > >> +	if (val == 0) {
> > >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> > >> +		kmsg_dump(KMSG_DUMP_OOPS);
> > >> +	} else if (val == 1) {
> > >> +		char *null_pointer = NULL;
> > >> +
> > >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> > >> +		*null_pointer = 1;
> > >> +	} else {
> > >> +		return -EINVAL;
> > >> +	}
>> > >
> > > This isn't quite what I had in mind, since it still kills the kernel (like
> > > sysrq-trigger).
> >
> > If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> > And it doesn't taint the kernel either.
>
> Ah I didn't realize that. Sounds like a good option to keep.
>
> > > Instead what I had in mind is to recreate the worst
> > > possible panic context as much as feasible (disabling interrupts should be
> > > a good start, maybe we can even do an nmi callback), and then call our
> > > panic implementation. That way we can test the panic handler in a
> > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > tainting).
> >
> > The interrupt case I can do, nmi I have no idea.
>
> I just read the printk nmi code again and it looks like there's now even
> more special handling for issues happening in nmi context, so we should
> never see an oops from nmi context. So local_irq_disable() wrapping should
> be good enough for testing.
> -Daniel
>

Hmmm, John is on a mission to change that soon:

  - https://lore.kernel.org/lkml/20190212143003.48446-1-john.ogness@linutronix.de

  - https://lwn.net/Articles/780556

A v2 is on the way, and there's a lot of interest from different
groups, including RT, to get that in.

So I guess it would be nice to cover NMI contexts from the start,
not necessarily from a non-destructive CI testing perspective, but
at least while doing normal tests and code reviews.

(Is it possible to do non-destructive NMI tests? I don't know..)

thanks!

--
darwi
http://darwish.chasingpointers.com