[Intel-gfx] [PATCH 6/9] drm/i915: driver based PASID handling

Fri Oct 9 00:28:37 PDT 2015

On Thu, Oct 08, 2015 at 11:46:08PM +0100, David Woodhouse wrote:
> On Thu, 2015-10-08 at 12:29 +0100, Tomas Elf wrote:
> > 
> > Could someone clarify what this means from the TDR point of view, 
> > please? When you say "context blew up" I'm guessing that you mean that 
> > come context caused the fault handler to get involved somehow?
> > 
> > Does this imply that the offending context will hang and the driver will 
> > have to detect this hang? If so, then yes - if we have the per-engine 
> > hang recovery mode as part of the upcoming TDR work in place then we 
> > could handle it by stepping over the offending batch buffer and moving 
> > on with a minimum of side-effects on the rest of the driver/GPU.
> 
> I don't think the context does hang.
> 
> I've made the page-request code artificially fail and report that it
> was an invalid page fault. The gem_svm_fault test seems to complete
> (albeit complaining that the test failed). Whereas if I just don't
> service the page-request at all, *then* the GPU hang is detected.
> 
> I haven't actually looked at precisely what *is* happening.

Hm if this still works the same way as on older platforms then pagefaults
just read all 0 and writes go nowhere from the gpu. That generally also
explains ever-increasing numbers of the CS execution pointer since it's
busy churning through 48b worth of address space filled with MI_NOP. I'd
have hoped our hw would do better than that with svm ...

If there's really no way to make it hang when we complete the fault then I
guess we'll have to hang it by not completing. Otherwise we'll have to
roll our own fault detection code right from the start.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch