[BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Thu Sep 12 08:43:29 PDT 2013

On Thu, Sep 12, 2013 at 05:36:43PM +0200, Daniel Vetter wrote:
> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
> removed. It was there to give the error handler a chance to sneak in
> and reset the hw/sw tracking when the gpu is dead. That hack goes back
> to the days when the locking around our error handler was somewhere
> between nonexistent and totally broken, nowadays we keep things from
> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
> whip up a patch to rip this out. I'll also check that our testsuite
> properly exercises this path (needs a bit of work on a quick look for
> better coverage).
> 
> The one in udl just looks like copypasta from i915, without any
> justification (at least I don't see any) for why it's there. Probably
> can die, too, since there isn't any gpu to reset on usb display-link
> devices ...

OK, awesome that. 2 down.

> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
> into it's own pagefault handler and then deadlock, the trylock just
> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
> fun userspace did and now have testcases for them. The right solution
> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
> holds locks and have slowpaths which drops locks, copies stuff into a
> temp allocation and then continues. At least that's how we've fixed
> all those inversions in i915-gem. I'm not volunteering to fix this ;-)

Yikes.. so how common is it? If I simply rip the set_need_resched() out
it will 'spin' on the fault a little longer until a 'natural' preemption
point -- if such a thing is every going to happen.

It would make that path take longer, but not be more or less broken.

So if its a rare path, I'll just rip the set_need_resched() out and you
DRM people can then fix up at your own pace.