[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output

Thu Mar 14 04:45:32 UTC 2019

Hi,

On Wed, Mar 13, 2019 at 09:35:05AM +0100, Daniel Vetter wrote:
> On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote:
> > On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> > > Den 11.03.2019 20.23, skrev Daniel Vetter:
[……]
> > > >
> > > > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > > > think that's good enough. Comment to that effect would be good e.g.
> > > >
> > > > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > > > 	 * assume still works */
> > > >
> > > > If we aim for perfect this should be a trylock still, maybe using our own
> > > > device list.
> > > >
> >
> > I definitely agree here.
> >
> > The lock may already be locked either by a stopped CPU, or by the
> > very same CPU we execute panic() on (e.g. NMI panic() on the
> > printing CPU).
> >
> > This is why it's very common for example in serial consoles, which
> > are usually careful about re-entrance and panic contexts, to do:
> >
> >   xxxxxx_console_write(...) {
> > 	if (oops_in_progress)
> > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > 	else
> > 		spin_lock_irqsave(&port->lock, flags);
> >   }
> >
> > I'm quite positive we should do the same for panic drm drivers.
>
> Yeah Ideally all the locking in the drm path would be trylock only.
>
> I wonder whether lockdep could help us validate this, with some "don't
> allow anything except trylocks in this context". It's easy to audit the
> core code with review, but drivers are much tougher. And often end up with
> really deep callchains to get at the backing buffers.
[……]
> > > > Instead what I had in mind is to recreate the worst
> > > > possible panic context as much as feasible (disabling interrupts should be
> > > > a good start, maybe we can even do an nmi callback), and then call our
> > > > panic implementation. That way we can test the panic handler in a
> > > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > > tainting).
> > >
> > > The interrupt case I can do, nmi I have no idea.
> > >
> >
> > I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> > would be a nice non-destructive test-case emulation.
>
> See above, if we can somehow emulate "all locks are held, only allow
> trylock" with lockdep that would be great too.
>

Hmmmm, to prototype things faster, I'll do it as a hook:

   spin_lock(spinlock_t *lock)
   {
        might_spin_forever_under_panic_path();
        ...
   }

Just like how mutexes and DEBUG_ATOMIC_SLEEP do it today:

   void mutex_lock(struct mutex *lock)
   {
        might_sleep();
        ...
   }

Hopefully the locking maintainers can accept something like this.
If not, let's play with lockdep (which is a bit complex).

===> Thanks to Sebastian for suggesting this!

> Plus nmi context, once/if that somehow becomes relevant.
>
> The thing that killed the old drm panic handling code definitely was that
> we flat out couldnt' test it except with real oopses. And that's just
> whack-a-mole and bug reporter frustration if you first have a few patch
> iterations around "oh sry, it's still some oops in the panic handler, not
> the first one that we're seeing" :-/
>

Oh, that's a critical point:

   full data > some data > no data > invalid/useless data

First impressions for the feature will definitely count. Hopefully
the hooks above and some heavy DRM CI testing __beforehand__ can
improve things before publishing to a wider audience..

Cheers,

--
darwi
http://darwish.chasingpointers.com