[Intel-gfx] i915: severe lag after resume (was Re: i915: hotplug events gone wild)

Wed Feb 10 15:15:19 CET 2010

I did some digging with gdb and perf.  Here's what's happening:

1. Something causes i915 to detect and report a flood of bogus hotplug events.
2. Fedora has a patched version of the "intel" driver that detects
uevents and calls RRGetInfo.
3. RRGetInfo eventually issues a bunch of GETCONNECTOR calls.
4. i915 reprobes an input on each GETCONNECTOR call, and the LVDS
probe is slow.  X is blocked while this is happening, and it's slow
enough and frequent enough that it makes X unusable.

I think that there are a few "bugs" here:

First, the root cause: why are we detecting a flood of hotplug events?
 They seem to come from HDMI or SDVO, none of which I have.  I don't
really understand the connector initialization code -- SDVO and HDMI
seem to share some hardware (sometimes, maybe), and they share a bunch
of hotplug bits, yet I detect:

ard0-DisplayPort-1
card0-DisplayPort-2
card0-DisplayPort-3
card0-HDMI Type A-1
card0-HDMI Type A-2
card0-LVDS-1
card0-VGA-1

I actually have one DP plug, one VGA plug, and one LVDS device.  No
HDMI on my laptop.

There's some funny business in at least intel_dp_detect, which fiddles
with the hotplug enable bits.  And there's no locking in any of the
detect code.

Second, the whole structure of hotplug in the KMS code feels like a
throwback to UMS.  We have an interrupt that does essentially nothing
except notifying userspace that an event happened (and training the DP
link), and then the callbacks from drm do the actual detection and
mode enumeration on every GETCONNECTOR ioctl and sysfs status read.
Is there any good reason that the driver doesn't maintain its own
internal idea of what's connected, update it on hotplug, and just
return it in response to userspace queries?  This would make
GETCONNECTOR fast and make it possible for the driver to tell what
changed in any given hotplug event.

For an example of why I think the current code is bad, try compiling
this (with -lX11 -lXrandr):

hose_x.c

#include <stdio.h>
#include <stdlib.h>

#include <X11/Xlib.h>
#include <X11/extensions/Xrandr.h>

int main(int argc, char *argv[])
{
  Display *disp;
  int count, i;

  disp = XOpenDisplay(NULL);

  if (!disp) {
    printf("Bad\n");
    return -1;
  }

  count = (argc == 2 ? atoi(argv[1]) : 1);

  for (i = 0; i < count; i++)
    XRRGetScreenResources(disp, DefaultRootWindow(disp));
}

and running hose_x <some large number>.  Then try to move the mouse.
That's *exactly* what my laptop feels like when it's badly hosed.

Or run: while true; do ./hose_x 1; sleep 0.5; done  and try to use the
computer.  That's pretty much what my laptop feels like when it's a
little bit hosed.

IMHO hose_x should take CPU time but shouldn't cause lag.  perf shows
almost all the time in i2c code.

--Andy

On Mon, Feb 8, 2010 at 7:25 AM, Andrew Lutomirski <luto at mit.edu> wrote:
> On Thu, Feb 4, 2010 at 1:11 PM, Jesse Barnes <jbarnes at virtuousgeek.org> wrote:
>> On Thu, 4 Feb 2010 11:41:48 -0500
>> Andrew Lutomirski <luto at mit.edu> wrote:
>>
>>> On Sun, Jan 31, 2010 at 9:54 PM, Andrew Lutomirski <luto at mit.edu>
>>> wrote:
>>> > On Sun, Jan 31, 2010 at 8:03 PM, ykzhao <yakui.zhao at intel.com>
>>> > wrote:
>>> >> On Sun, 2010-01-31 at 19:49 +0800, Andrew Lutomirski wrote:
>>> >>> On Sat, Jan 30, 2010 at 10:02 PM, Andrew Lutomirski
>>> >>> <luto at mit.edu> wrote:
>>> >>> > [I posted this bug earlier with a terrible description as
>>> >>> > "resume lagginess and other problems."  Here it is again with a
>>> >>> > better bug report.]
>>> >>> >
>>> >>> > I'm running 2.6.33-rc5 (plus some wireless-testing stuff, but
>>> >>> > I've seen this problem on a variety of 2.6.33-rc? kernels).
>>> >>> >  Every now and then, X starts to lag badly on my GM45 laptop.
>>> >>> >  When this happens, I usually see a bunch of events in
>>> >>> > udevmonitor.  Running with drm.debug=0x02 (and the patch below
>>> >>> > to keep the log under control), I see tons of messages like
>>> >>> > this:
>>> >>>
>>> >>> I triggered it again.  This time, the messages looked like
>>> >>> (drm.debug=3 from a different VT to avoid all the hotplug stuff
>>> >>> running off the screen, and running a different debugging hack --
>>> >>> see all the way at the bottom):
>>> >>>
>>> >>> [ 1324.285057] [drm:i915_driver_irq_handler], hotplug event
>>> >>> received, stat 0x28200000, mask 0x38000800
>>> >>
>>> >> >From the stat value it seems that this is related with the HDMI
>>> >> >hotplug.
>>> >> Will you please confirm whether th HDMI is pluged/unpluged in your
>>> >> test?
>>> >
>>> > This is a Lenovo X200s, and it doesn't have HDMI.  I have LVDS on
>>> > and everything else (i.e. VGA and the docking station, which has a
>>> > DP port) disconnected.
>>> >
>>>
>>> I don't think this is a hotplug bug.  I don't remember seeing it back
>>> in early January (i.e. before by laptop died and got its motherboard
>>> replaced).  The bug is present in 2.6.32.7 and in 2.6.33-rc6.
>>>
>>> I think it's a bad interaction between some kind of idle code and
>>> suspend/resume.  I can't trigger it without suspending and resuming at
>>> least once after reboot, and I can't make it go away completely once
>>> it starts triggering.
>>>
>>> Once the bug starts, it seems to manifest in one of two forms.
>>>
>>> Bug form 1 (the bad one): X lags so badly that I can hardly do
>>> anything.  The mouse seems to update only twice a second or so.
>>> Compositing gets so slow that I can't use any programs.  I can switch
>>> VTs and use the console, but switching back to X doesn't fix it.
>>> udevmonitor shows a huge flood of events (4/second, maybe).  Once this
>>> starts, it keeps happening for quite awhile or until I kill X.
>>> Killing X seems to switch me to bug form 2.
>>>
>>> Bug form 2 (the less bad one): When X is idle, my mouse seems to skip
>>> once or twice a second.  When X is not idle (e.g. I'm dragging a
>>> window), everything is fine.  intel_gpu_top seems to suppress the bug
>>> and shows nothing useful.  udevmonitor shows a slow stream of hotplug
>>> events.
>>>
>>> In either case, clearing the high bits of PORT_HOTPLUG_EN using
>>> intel_reg_write (i.e. writing 0x320 to 0x61110) stops the hotplug
>>> events but *does not* fix the lag.  (It is more reliable in stopping
>>> the hotplug events if I patch the dp detect code to not change the
>>> high bits back.)
>>>
>>> These problems seem to start one minute or so after resuming.  They're
>>> bad enough that suspend/resume is almost unusable.
>>>
>>> Userspace is F12.
>>>
>>> If it helps at all, I started noticing this bug at the same time that
>>> I noticed that writing 1 to reset in sysfs breaks graphics.  (It used
>>> to work.)
>>
>> So if you use powersave=0 you don't see the lag?  You could try
>> increasing the idle timer timeout; it's 1000ms now, you could make it
>> 5000ms or so, but I don't think we do anything when transitioning
>> to/from idle that would take long enough to cause huge lag...
>>
>
> I tested a bunch more combinations.
>
> powersave=0 does not prevent the lag.
>
> The lag is present on 2.6.32.1, 2.6.32.7, and 2.6.33-rc6.  It seems
> easier to trigger on 2.6.33-rc6 and 2.6.32.7, but I'm not sure exactly
> what triggers it in the first place.
>
> I triggered it once without suspending and resuming.
>
> I've watch 'top' once, and Xorg's CPU usage stayed at ~50% (i.e. one
> core), but if I dragged a window slowly, the lag stopped and CPU went
> *down*.  intel_gpu_top shows nothing that appears interesting.
>
> Manually clearing the low bit of PWRCTXA did not seem to stop the lag
> once it started (I figured that would be worth a try).
>
> There is never any lag on a framebuffer console, even if I do see
> hotplug events.
>
> Any more ideas?  This is making my laptop rather difficult to use.
>
> --Andy
>