[Intel-gfx] [PATCH v2] drm/i915: Restore inhibiting the load of the default context

Francisco Jerez currojerez at riseup.net
Thu Dec 10 05:57:30 PST 2015


Chris Wilson <chris at chris-wilson.co.uk> writes:

> On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote:
>> Mika Kuoppala <mika.kuoppala at linux.intel.com> writes:
>> 
>> > Chris Wilson <chris at chris-wilson.co.uk> writes:
>> >
>> >> Following a GPU reset, we may leave the context in a poorly defined
>> >> state, and reloading from that context will leave the GPU flummoxed. For
>> >> secondary contexts, this will lead to that context being banned - but
>> >> currently it is also causing the default context to become banned,
>> >> leading to turmoil in the shared state.
>> >>
>> >> This is a regression from
>> >>
>> >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
>> >> Author: Ben Widawsky <benjamin.widawsky at intel.com>
>> >> Date:   Mon Mar 16 16:00:58 2015 +0000
>> >>
>> >>     drm/i915: Initialize all contexts
>> >>
>> >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
>> >> default context.
>> >>
>> 
>> AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed
>> justified.  Ben explained that it was needed to fix a pagefault in the
>> default context under certain conditions.  I don't know the details of
>> the original change (Ben CC'ed), but it seems like this would be trading
>> one bug for another?
>
> A bug in a feature that never worked and isn't enabled?
>  
>> Other than that this opens the door again to subtle state leaks between
>> processes, as I learned recently while implementing L3 partitioning in
>> Mesa.  Mesa now changes the L3 configuration in ways that will break
>> assumptions from processes that use the default context (the DDX).  The
>> DDX assumes, for instance, that the URB size is set according to the
>> hardware defaults, but it doesn't actually program the URB size itself,
>> so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the
>> default context for correct operation.  This commit breaks that
>> assumption and the kernel ABI.
>
> Wrong the ABI is the other way around and has been for 10 years. Note
> the kernel isn't also ensuring that the default context has the default
> hardware values either. The "golden renderstate" also doesn't set all
> defaults and is also a recent introduction.
>
> It changes the ABI back to what it was....

Sounds like a change fully unrelated to fixing the bug on GPU reset.

The ABI change done in 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b seemed
legitimate because it made the context state on switch more well-defined
than it used to be, IOW it was a backwards-compatible ABI change because
it couldn't possibly break userspace assumptions.  This change OTOH
breaks an assumption about the kernel ABI done by the DDX now.

>  
>> Mesa has a workaround in place to reduce the likelihood of such leaks,
>> but the solution is far from ideal because it relies on userspace
>> cooperation and had a measurable impact in performance (because it
>> requires userspace to assume the worst-case scenario that the following
>> batch is going to be from a different context with MI_RESTORE_INHIBIT
>> set, so we have to restore the hardware default L3 configuration at the
>> end of every batch even if that's actually not the case), for that
>> reason we would like to drop the userspace workaround in the future at
>> least on v4.1 kernels and newer.
>
> Mesa has workarounds for leaks between hardware contexts, i.e. state
> that is not stored in the context itself. Mesa also has to assume a
> clean slate when setting up the context for the first time, and doesn't
> use the default context itself.
>  

No, the workaround involves restoring state which *is* part of the
hardware context, it just won't get saved and restored with this commit
because of the MI_RESTORE_INHIBIT flag when switching to the DDX'
default context.

>> One more question below.
>> 
>> >> v2: Mark the global default context as uninitialized on GPU reset so
>> >> that the context-local workarounds are reloaded upon re-enabling.
>> >>
>> >> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> >
>> > Reviewed-by: Mika Kuoppala <mika.kuoppala at intel.com>
>> >
>> >> Cc: Michel Thierry <michel.thierry at intel.com>
>> >> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
>> >> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>> >> ---
>> >>  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
>> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> >> index 43761c5bcaca..f024d5d2c746 100644
>> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> >> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
>> >>  			i915_gem_context_unreference(lctx);
>> >>  			ring->last_context = NULL;
>> >>  		}
>> >> +
>> >> +		/* Force the GPU state to be reinitialised on enabling */
>> >> +		if (ring->default_context)
>> >> +			ring->default_context->legacy_hw_ctx.initialized = false;
>> >>  	}
>> >>  }
>> >>  
>> >> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
>> >>  	if (ret)
>> >>  		goto unpin_out;
>> >>  
>> >> -	if (!to->legacy_hw_ctx.initialized) {
>> >> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
>> 
>> This hunk causes MI_RESTORE_INHIBIT to be set again for the default
>> context regardless of whether a reset happened or not, so it seems
>> unrelated to the rest of your change.  Maybe I'm understanding this
>> wrong but doesn't the !initialized check together with the hunk above
>> already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset,
>> which is what you wanted to achieve?
>
> This is the change to *restore* the kernel ABI. The flag in reset is to
> force the WA LRI to be re-emitted (not for gen6 ofc) as they will not be
> preserved.

Again you don't justify why changing the kernel ABI is required to fix
a GPU reset bug.

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20151210/7f17bb19/attachment.sig>


More information about the Intel-gfx mailing list