[PATCH v2 0/3] drm: Add panic handling

Sun Mar 17 23:06:13 UTC 2019

=> Now that the dust has settled, here's a summary of this huge
   50-email thread (thanks Daniel, Noralf, John, everyone!).

=> Parts of this document are a direct rewording of Daniel's replies,
   so I took the liberty of adding a Co-developed-by tag here..

=> This is only a summary, and _not_ an official patch submission.
   It's now Show-me-the-code time ;-)

Subject: [PATCH] Documentation: gpu: Add initial DRM panic design

Co-developed-by: Daniel Vetter <daniel at ffwll.ch>
Signed-off-by: Ahmed S. Darwish <darwish.07 at gmail.com>
---
 Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 Documentation/gpu/drm-panic-design.rst

diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
new file mode 100644
index 000000000000..ba306193f347
--- /dev/null
+++ b/Documentation/gpu/drm-panic-design.rst
@@ -0,0 +1,124 @@
+
+========================
+A DRM-based panic viewer
+========================
+
+The Linux Kernel currently contains the necessary plumbing for viewing
+a kernel panic log using DRM-based kmsg dumpers, even if the system is
+currently running a graphical session (e.g. wayland).
+
+.. _drm_panic_design:
+
+Implementation design
+=====================
+
+Code pathes running in a panic context face several constraints:
+
+1. Code must be fully synchronous: interrupts are disabled
+2. Dynamic memory allocations are not allowed
+3. Cannot acquire any mutexes: atomic context, due to 1.
+4. Cannot acquire any spin locks: potential spinning-forever risk
+
+For the *DRM* panic code, the extra conditions below apply:
+
+5. Code must only trylock relevant DRM subsystem locks
+6. If any trylock operation fails, the code *must not* go through
+7. All rendering is done on the GPU's "current display buffer" used
+   at the time of panic(): no HW programming is done at all.
+8. The code must be non-intrusive, and *must not* affect any other
+   panic handling mechanism (netconsole, ramoops, kexec, etc.)
+
+Rationale follows.
+
+
+Spin locks
+----------
+
+Acquiring a spin lock in a panic() context is potentially lethal:
+the lock might've been already acquired, _permanently_, by another
+core that is now fully shut down through an IPI from the panic()-ing
+core.
+
+Moreover, at least on x86, the first target architecture for this
+work, the panic()-ing core wait by default for *a full second* until
+all other cores finish their non-preemptible regions and terminate.
+If that did not work out, it even tries more aggressively with NMIs.
+
+So if the other non panic()-ing cores was holding a DRM-related lock
+through spin_lock_irqsave() for more than one second, then it's a
+bug in the DRM layer code. Thus, the DRM panic viewer cannot do
+anything and should just exit.  [A]
+
+What if the non panic()-ing core was holding a DRM lock through
+barebone spin_lock()? Interrupts are enabled there, so the IPI will be
+handled, and thus that core will effectively die while the lock is
+*forever held*.  [B]
+
+
+Trylocking
+----------
+
+The DRM panic code always *tries* to acquire the *minimum relevant
+set* of DRM related locks, through the basic :c:func:`spin_trylock()`
+mechanism.
+
+From case [A] and case [B] above, if the trylock operation fails,
+there's no point in retrying it multiple times: the relevant locks
+are in a broken and unrecoverable state anyway.
+
+Moreover, The panic code cannot also just ignore the DRM locks and
+force its way through: a broken non-preemptible DRM region implies
+either invalid SW state (e.g. broken linked list pointers), or a GPU
+in an unknown HW state.
+
+A GPU in an unknown HW state is quite dangerous: it has access to the
+full system memory, and if poked incorrectly, there's a really good
+chance it can kill the entire machine.
+
+
+GPU hardware access
+-------------------
+
+In the current state, a full GPU reset, modesetting, or even disabling
+GPU planes, is not doable under a panic() context: it implies going
+through a potentially huge set of DRM call-chains that cannot be
+sanely verified against the :ref:`drm_panic_design` requirements
+(e.g. memory allocations, spinlocks, etc.).
+
+The current approach is simple: run the minimal amount of code
+necessary to draw pixels onto the current scanout buffers. Instead
+of disabling GPU planes, the biggest visible rectangle is just picked.
+
+*Usually* there should be a main window that is big enough to show the
+oops.
+
+
+CI testing
+----------
+
+One of the things that killed earlier linux DRM panic handling efforts,
+beside getting into deep DRM call-chains that cannot be verified, was
+that it couldn't be tested except with real oopses.
+
+The first set of bug reports was whack-a-molde kind of bugs where the
+oops displayed was caused by the DRM panic handler itself instead of
+the real oops causing the panic.
+
+Thus, the :ref:`drm_panic_design` requirements was created. Moreover:
+
+  - Special hooks are added at the spin_lock() level to complain
+    loudly if a spin lock operation was tried under the DRM panic
+    context. This could be easily noticed/reported by CI testing.
+
+  - *Non-destructive* testing of the DRM panic code, emulating a
+    real panic path context as much as possible (e.g. by disabling
+    irqs and enabling the spin lock hooks earlier mentioned), is
+    created. This is necessary for heaviling testing the DRM panic
+    code through `CI machines <https://lwn.net/Articles/735468/>`_.
+
+
+Supported drivers
+=================
+
+* Intel i915-compatible cards


--
darwi
http://darwish.chasingpointers.com