[Intel-gfx] [PATCH 0/8] Detect and deal with Interrupt 'Storms' from noisy Hotplug Lines.

Fri Jan 11 21:34:08 CET 2013

On Thu, Jan 10, 2013 at 10:02:38AM -0500, Egbert Eich wrote:
> Despite the many attempts to fix the issue with noisy hotplug interrupt lines
> we are still seeing systems that suffer from this:
> Recently we encountered a rather large scale installation of Q35 systems
> which was hit by this issue rather severely: It seemed as if not all machines
> of the same model were hit equally bad, in the worst cased hotplug
> interrupt noise caused several 1000 interrupts / s. Those machines would not
> even boot, instead the interrupt handler and the scheduled workers would keep
> the CPU  busy that eventually the watchdog would kick in and issue an NMI.
> Other machines only received severa 10s to 100s of interrupts per sec - those
> machines would run properly - just with an excessive system load.
> More thorough investigations seemed to indicate that this condition
> only happen at certain video modes.
> 
> On another system - a laptop - a hotplug interrupt 'storm' occurred when 
> it was charging and the batteries were at certain charge levels. While 
> the system was still running fine its load was high enough that the user
> noticed from the fan noise that a problem existed.
> The latter system had a Sandybridge chipset, thus a totally different 
> generation from the former.
> 
> All those cases seemed to have been caused by cross talk on badly routed 
> hotplug signal lines (or voltage instabilities).
> This led to the conclusion that instead of trying to work around these
> 'storms' for each individual system, there should be a generic way to detect 
> such a condition and take appropriate action:
> 
> This patch series implements a hotplug 'storm' detection, disables the
> respective interrupt for the hotplug pin when this condition is detected
> and reverts to periodic output polling on the affected connector.
> After a grace period of 2 minutes it will reenable hotplug on the affected
> line. This will take care of cases in which this condition is only temporary.
> Should the 'storm' condtion persist, this cycle will start over again.
> 
> To implement this some rearrangements in the code were required:
> - The interrupt status bit which signals a hotplug needed to be recorded
>   for each connector.
> - The interrupt enable functions needed to be separate, also they need 
>   to be able to enable interrupts for each hotplug line independently.

Nice work, and we know that we need this since quite a while. But
unfortunately we've not yet come around to implement something. Some
high-level comments on how I think this should best be handled:

- imo dv_priv->hotplug_supported_mask should die - it leaks platform
  specific irq magic from i915_irq.c into every connector/encoder. And we
  have had the bugs and confusions to prove that it's not a good idea. I
  think it'd be better if we add a new HOTPLUG_PIN_FOO enum that encoders
  register interest in, and the platform code in i915_irq.c then maps
  from/to that. On a quick check we have hotplug pins for CRT, TV,
  SDVO_B&C and PORT_A-D (for DP&HDMI).

  Also note that on PCH_SPLIT platforms port A is not in the same
  register, further platforms will make an even cuter mess of this ...

- I think the the hpd pin should be track in the encoder, not in the
  connector. The only encoders where there's not a 1:1 relationship (sdvo
  and ddi on hsw) want it there. Also, we already have the ->hot_plug
  callback in the encoder, which will be useful for later extensions.

- Since some encoders share the same hpd pin (HDMI&DP on pre-hsw) I think
  we should keep the noise statistic data in the device's dev_priv
  somewhere in an array, with one set for each hpd pin from the enum above.

- In 3.8 the drm hpd/polling helpers are much improved and don't randomly
  poll everything any more. So if a hpd connector isn't marked as
  OUTPUT_POLL, it wont ever get polled. Which means if you disable the hpd
  irq for it, we need to have our own poll work to do that for us. The
  long-term goal I have is to pimp the encoder->hot_plug callback also for
  this case, to avoid re-running the connector detect code on unrelated
  outputs (which can sometimes cause havoc).

  Eventually a want a hpd interrupt to only run the ->hot_plug callbacks
  on encoders which are interested in that signal, hence this slight
  overkill ... Ofc, that requires that we move a lot of the ->detect logic
  into ->hot_plug, but that's the only way to do sane EDID cache and
  similar things on outputs where hpd should work (DP/HDMI).

- The math buff in me would like hpd stroms to gracefully degrade into
  polling at 10s or so. We could achieve that with irq source masking and
  scheduling the work item to do the hotplug handling with an (increasing)
  delay if there's too many interrupts from a given hpd pin. But that
  requires that we can mask hotplug interrupts properly, which seems to be
  impossible with the PORT_HOTPLUG regs on gmch/SoC platforms :( So I
  think your logic is nice enough ;-)

Yours, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch