[Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

Mon Jan 27 14:45:22 CET 2014

On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
> On Sun, Jan 26, 2014 at 07:55:59PM +0000, Chris Wilson wrote:
> > On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> > > On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > > > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > > > The previous check during error capture of whether or not the current VM
> > > > > should be scanned used, gen < 7. That was more or less trying to
> > > > > determine if there was a full PPGTT. At the time, this was sort of what
> > > > > I meant to do because I was more interested in working backwards from
> > > > > hardware state. However, this is incorrect because it will not include
> > > > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > > > greater than gen7 with the PPGTT module parameter invoked.
> > > > > 
> > > > > I am /assuming/ BYT was broken, I have not actually checked.
> > > > > 
> > > > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > > > the PPGTT info is in the error state).
> > > > > 
> > > > > I think Mika/Chris may have been looking at this too.
> > > > 
> > > > Sure, we are looking (for identifying the guilty request/batch) by using
> > > > the older, simpler mechanism of finding the first incomplete request. I
> > > > think that search is now definite since we preallocate the request and no
> > > > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > > > between seqno/batch/request).
> > > > 
> > > > That should also apply here and be much simpler.
> > > 
> > > How does that solve hangs which aren't caused by requests?
> > 
> > Was that an intentional rhetorical question?
> > 
> > The code you touch here only deals with requests - finding the current
> > batchbuffer if any.
> > -Chris
> > 
> 
> It wasn't rhetorical. I temporarily ignored that all batches are tied to
> a request.
> 
> So what's the plan now? Just looking at the callers, we seem to have a
> couple of callers that can't easily identify the bad request.

I was thinking along the lines of:

@@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
        }
 
        seqno = ring->get_seqno(ring, false);
-       list_for_each_entry(vm, &dev_priv->vm_list, global_link) {
-               if (!is_active_vm(vm, ring))
+       list_for_each_entry(request, &ring->request_list, list) {
+               if (i915_seqno_passed(seqno, request->seqno))
                        continue;
 
-               found_active = true;
-
-               list_for_each_entry(vma, &vm->active_list, mm_list) {
-                       obj = vma->obj;
-                       if (obj->ring != ring)
-                               continue;
-
-                       if (i915_seqno_passed(seqno, obj->last_read_seqno))
-                               continue;
-
-                       if ((obj->base.read_domains & I915_GEM_DOMAIN_COMMAND) == 0)
-                               continue;
-
-                       /* We need to copy these to an anonymous buffer as the simplest
-                        * method to avoid being overwritten by userspace.
-                        */
-                       return i915_error_object_create(dev_priv, obj, vm);
-               }
+               /* We need to copy these to an anonymous buffer as the simplest
+                * method to avoid being overwritten by userspace.
+                */
+               return i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
        }
 
-       WARN_ON(!found_active);

-- 
Chris Wilson, Intel Open Source Technology Centre