[Intel-gfx] [PATCH 5/8] drm/i915: vma NULL pointer check

Fri Oct 9 04:59:44 PDT 2015

On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote:
> On 09/10/2015 08:48, Chris Wilson wrote:
> >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> >>iterating and do early exit instead of causing a NULL pointer exception.
> >
> >Wrong.
> >-Chris
> >
> 
> So the NULL pointer exception that I ran into multiple times during
> several different test runs on the line that says "vma->pin_count"
> was not because the vma pointer was NULL. Would you mind sharing
> your explanation to how this might have happened? Is it because
> we're not synchronizing and there's no protection against the driver
> deallocating vmas while we're trying to capture them? If this all
> ties into the aforementioned RCU-based solution then maybe we should
> go with that then.

Correct. The driver is retiring requests whilst the hang check worker is
running. And you chased a stale pointer, you could have equally chased a
stale vma->obj, vma->ctx etc pointers.

What I have done in the past is to serialise the retirement and the
hangcheck using a spinlock (as adding to the end of the list is safe),
but there are still weak spots when walking the list of bound vma.
What I would actually feel more comfortable with is to only record the
request->batch_vma, at the cost of losing information about the
currently bound buffers.

Or we could just stop_machine() whilst running the capture and have the
machine unresponsive for a few 100ms. I don't think simply RCU the lists
is enough (VM active_list, request list, bound list) as eventually we
chase a pointer from obj itself (which means we need to RCU pretty much
everything).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre