[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output

Thu Mar 14 02:51:13 UTC 2019

[[ Adding Sebastian, who is quite experienced in intricate
   locking situations due to daily PREEMPT_RT work.. ]]

On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
> On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote:
> > On 2019-03-12, Ahmed S. Darwish <darwish.07 at gmail.com> wrote:
> > > On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
[……]
> > > >
> > > > If we aim for perfect this should be a trylock still, maybe using
> > > > our own device list.
> > > >
> > >
> > > I definitely agree here.
> > >
> > > The lock may already be locked either by a stopped CPU, or by the
> > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > printing CPU).
> > >
> > > This is why it's very common for example in serial consoles, which
> > > are usually careful about re-entrance and panic contexts, to do:
> > >
> > >   xxxxxx_console_write(...) {
> > > 	if (oops_in_progress)
> > > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > > 	else
> > > 		spin_lock_irqsave(&port->lock, flags);
> > >   }
> > >
> > > I'm quite positive we should do the same for panic drm drivers.
> >
> > This construction will continue, even if the trylock fails. It only
> > makes sense to do this if the driver has a chance of being
> > successful. Ignoring locks is a serious issue. I personally am quite
> > unhappy that the serial drivers do this, which was part of my motivation
> > for the new printk design I'm working on.
> >
> > If the driver is not capable of doing something useful on a failed
> > trylock, then I recommend just skipping that device. Maybe trying it
> > again later after trying all the devices?
>
> Ah yes missed that. If the trylock fails anywhere, we must bail out.
>
> Not sure retrying is useful, my experience from at least drm is that
> either you're lucky, and drm wasn't doing anything right when the machine
> blew up, and then the trylocks will all go through. Or you're unlucky, and
> most likely that means drm itself blew up, and no amount of retrying is
> going to help. I wouldn't bother.
>

Ack, retrying most probably won't help.

I see that, at least in x86:

  kernel/panic.c::panic()
    => kernel/panic.c::smp_send_stop()
      => arch/x86/incluse/smp.h::smp_send_stop(0)
        => arch/x86/kernel/smp.c::native_stop_other_cpus(wait)
          /*
           * Don't wait longer than a second if the caller
           * didn't ask us to wait.
           */
           timeout = USEC_PER_SEC;
           while (num_online_cpus() > 1 && (wait || timeout--))
                   udelay(1);

So the panic()-ing core wait by default for _a full second_ until
all other cores finish their non-preemptible regions. If that did
not work out, it even tries more aggressively with NMIs.

So if one of the other non panic()-ing cores was holding a DRM
related lock through spin_lock_irqsave() for more than one second,
well, it's a bug in the DRM layer code anyway..      [A]

Problem is, what will happen if the non panic()-ing core was
holding a DRM lock through barebone spin_lock()? Interrupts are
enabled there, so the IPI will be handled, and thus that core will
effectively die while the lock is forever held :-(   [B]

But, well, yeah, I don't think there's a solution for the [B] part
except by spin_trylock() the fewest amount of locks possible, and
bailing out if anyone of them is held.

For reference, I asked over IRC why not just resetting GPU state,
but it's not easy as it sounds to people outside of the gfx world:

  [abridged]
  <darwi>  So it seems Windows is forcing driver writers to
           implement a gpu reset method [*]
  <darwi>  since Windows vista
  <danvet> yeah
  <danvet> not going to do that from panic
  <danvet> there's a non-zero chance that gpu reset takes down
           the system with at least lots of hw/driver combos
  <danvet> I think all our drivers also have gpu reset support
  <danvet> if something is still rendering while we panic, mea culpa
  <danvet> only way around that is switching planes
  <danvet> which we tried with the old approach, and that's just
           too hard
  <danvet> you end up running a few 100kloc of code worst case
  <danvet> no way to audit/validate that for panic context
  <danvet> but the rendering shouldn't really matter
  <anholt> yeah, actually modesetting in a panic seems hopeless.
  <danvet> at least for modern compositors
  <danvet> anholt, yeah I killed that idea
  <danvet> so they only render to the backbuffer
  <danvet> and we won't show that with a panic
  <anholt> yep.  just draw on a plane.  for everything but
           uncomposited x11, it'll already be idle.
  <danvet> so only really an issue on boot splash and frontbuffer x11
  <danvet> and there a few cursors/characters drawn won't really
           matter in most cases
  <danvet> anholt, yeah that's the idea, we pick the currently
           showing plane and have a simple draw_xy functions which
	   knows how to draw bright/dark pixels for any drm_fourcc
  <danvet> so you can see the oops even if it's yuv or whatever
  <anholt> nice

Thanks!

[*] DXGKDDI_RESETFROMTIMEOUT callback function:
    https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/d3dkmddi/nc-d3dkmddi-dxgkddi_resetfromtimeout

--
darwi
http://darwish.chasingpointers.com