[Intel-gfx] [PATCH 3/8] drm/i915: Cope with request list state change during error state capture

Wed Oct 14 05:45:32 PDT 2015

On Wed, Oct 14, 2015 at 12:46:51PM +0100, Tomas Elf wrote:
> On 13/10/2015 12:39, Daniel Vetter wrote:
> >On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote:
> >>On 09/10/2015 08:48, Chris Wilson wrote:
> >>>On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> >>>>Since we're not synchronizing the ring request list during error state capture
> >>>>the request list state might change between the time the corresponding error
> >>>>request list was allocated and dimensioned to the time when the ring request
> >>>>list is actually captured into the error state. If this happens, throw a
> >>>>WARNING and do early exit and be aware that the captured error state might not
> >>>>be fully reliable.
> >>>>
> >>>>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> >>>>---
> >>>>  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>index 32c1799..cc75ca4 100644
> >>>>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
> >>>>  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
> >>>>  			struct drm_i915_error_request *erq;
> >>>>
> >>>>+			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
> >>>
> >>>Request cannot be null, count can legitmately be more, the WARN on is
> >>>inappropriate. Again, I sent several patches over the past couple of
> >>>years to fix this.
> >>>-Chris
> >>>
> >>
> >>Ok, so having the driver request list change grow in between the point where
> >>we allocate and set up the error state request list to the same size as the
> >>driver request list (since that's what count being larger than the list size
> >>implies) is legitimate? Traversing into unallocated memory seems pretty
> >>dodgy to me but if you say so.
> >
> >We still need to handle it ofc, but just not WARN on this condition since
> >it can happen.
> >-Daniel
> >
> 
> With the RCU discussion ongoing I guess we should we just drop this patch? I
> agree that what I've been seeing looks like a side-effect of concurrent
> memory deallocation. Whatever solution you reach should make this patch
> pointless.

Depending upon the solution we pick we might still need it, since if we
only make sure we don't chase freed memory but still allow the list itself
to get modified (which would be the case with kfree_rcu), then you need
it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch