[Intel-gfx] [PATCH 5/8] drm/i915: vma NULL pointer check

Tue Oct 13 04:43:56 PDT 2015

On Fri, Oct 09, 2015 at 12:59:44PM +0100, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote:
> > On 09/10/2015 08:48, Chris Wilson wrote:
> > >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> > >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> > >>iterating and do early exit instead of causing a NULL pointer exception.
> > >
> > >Wrong.
> > >-Chris
> > >
> > 
> > So the NULL pointer exception that I ran into multiple times during
> > several different test runs on the line that says "vma->pin_count"
> > was not because the vma pointer was NULL. Would you mind sharing
> > your explanation to how this might have happened? Is it because
> > we're not synchronizing and there's no protection against the driver
> > deallocating vmas while we're trying to capture them? If this all
> > ties into the aforementioned RCU-based solution then maybe we should
> > go with that then.
> 
> Correct. The driver is retiring requests whilst the hang check worker is
> running. And you chased a stale pointer, you could have equally chased a
> stale vma->obj, vma->ctx etc pointers.
> 
> What I have done in the past is to serialise the retirement and the
> hangcheck using a spinlock (as adding to the end of the list is safe),
> but there are still weak spots when walking the list of bound vma.
> What I would actually feel more comfortable with is to only record the
> request->batch_vma, at the cost of losing information about the
> currently bound buffers.
> 
> Or we could just stop_machine() whilst running the capture and have the
> machine unresponsive for a few 100ms. I don't think simply RCU the lists
> is enough (VM active_list, request list, bound list) as eventually we
> chase a pointer from obj itself (which means we need to RCU pretty much
> everything).

I discussed stop_machine with Chris a bit - it does avoid some of the
races, but not all of them (since the lists might be in wonky state right
when we stop_machine). And yes rcu means we need to rcu-free any object
that the error state looks at.

What we could do is protect list consistency with rcu (where possible, see
my other mail about possible problems). And use stop_machine to make sure
the underlying objects don't go poof (instead of kfree_rcu).

Not sure where the gap in that idea is yet ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch