[Intel-gfx] X hang with quirk VT switches

Thu Dec 4 03:44:23 PST 2014

At Thu, 4 Dec 2014 11:21:47 +0000,
Chris Wilson wrote:
> 
> On Thu, Dec 04, 2014 at 11:53:05AM +0100, Takashi Iwai wrote:
> > At Wed, 3 Dec 2014 18:31:45 +0000,
> > Chris Wilson wrote:
> > > 
> > > On Wed, Dec 03, 2014 at 03:45:35PM +0100, Takashi Iwai wrote:
> > > > Hi,
> > > > 
> > > > while checking the reported bug about VT switch hang on openSUSE 13.2,
> > > > I also could reproduce a similar issue as reported: namely, X hangs
> > > > when repeatedly switching VT quickly.
> > > > 
> > > > For example, running the following on KDE results in the stall of X.
> > > > 
> > > > 	% for i in $(seq 1 100); do chvt 1; chvt 7; done
> > > > 
> > > > Looking at the sysrq-t output, it stalls at drm_read().  And after
> > > > putting some debug prints at event handling codes, it shows like:
> > > > 
> > > >  drm_queue_vblank_event event_space=4064
> > > >  send_vblank_event event_space=4064
> > > >  drm_poll ENTER event_space=4064
> > > >  drm_poll mask=0x41 event_space=4064
> > > >  drm_poll ENTER event_space=4064
> > > >  drm_poll mask=0x41 event_space=4064
> > > >  drm_read ENTER event_space=4064
> > > >  drm_read total=32 event_space=4096
> > > >  drm_poll ENTER event_space=4096
> > > >  drm_poll mask=0x0 event_space=4096
> > > >  drm_read ENTER event_space=4096
> > > >  drm_read ENTER event_space=4096
> > > >  drm_read ENTER event_space=4096
> > > > 
> > > > So, after a vblank event, two poll calls succeeded, followed by one
> > > > drm_read().  After that, there were one poll call without event,
> > > > followed by three(!) drm_read() calls.  The last three drm_read()
> > > > never exited, thus X stalled.  So, this looks like a race or a
> > > > refcount issue somewhere.
> > > 
> > > The key question is how did you get 3 calls to drm_read that each didn't
> > > return? The only place where we call drm_read without first doing a poll
> > > is in the WakeupHandler with the drm fd flagged for reads. This is
> > > broken in ZaphodHeads as the drm fd is not O_NONBLOCK without
> > > 
> > > commit bd008e5b2953186fc0c6633a885ade95e7043800
> > > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > > Date:   Tue Oct 7 14:13:51 2014 +0100
> > > 
> > >     drm: Implement O_NONBLOCK support on /dev/dri/cardN
> > > 
> > > I assume that isn't the case as I expect you would have mentioned using
> > > ZaphodHeads.
> > 
> > I took a look back at drm_read() code again, and I found that the
> > function doesn't care about O_NONBLOCK at all.  (And there is a memory
> > leak, too.)
> > 
> > So I added the support for O_NONBLOCK, and the problem seems
> > resolved.
> > 
> > Although this is no right "fix" (the caller side should be fixed), it
> > would be good to have in anyway.  I'm going to send patches for review
> > to dri-devel ML, as it's no i915 specific.
> 
> I disagree. drm has claimed to support O_NONBLOCK since its inception,
> but the implementation was buggy.

The nonblock read is obviously buggy.  If the current implementation
is intentional, then the nonblock flag is somehow misused...

> However, I don't think there is a case
> in non-ZaphodHeads where we use read() without first select/poll
> reporting that there is something to use (and the problem with
> ZaphodHeads is that we have two screens that share the same drm fd
> without clearing the select read flags... hmm)

In my case, I'm using a single screen, so this can't be.
And, my rough guess is that this isn't about the lack of poll but
rather some race between poll/read or two reads.  That explains why my
patch worked.

In anyway I'd need to trap X stall and diagnose, but I have to leave
my machine now.   Will check it tomorrow.

Meanwhile, it's interesting to see whether this covers Maarten's case,
too...

thanks,

Takashi