[Intel-gfx] [PATCH 2/3] drm/i915: Complete requests in nop_submit_request

Fri Nov 18 14:22:10 UTC 2016

On 18/11/2016 13:33, Chris Wilson wrote:
> On Fri, Nov 18, 2016 at 03:03:21PM +0200, Mika Kuoppala wrote:
>> Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com> writes:
>>
>>> On 18/11/2016 09:37, Chris Wilson wrote:
>>>> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
>>>> transfer onto execution timeline to actual hw submission") the
>>>> global seqno advance was deferred until the submit_request callback.
>>>> After wedging the GPU, we were installing a nop_submit_request handler
>>>> (to avoid waking up the dead hw) but I had missed converting this over
>>>> to the new scheme. Under the new scheme, we have to explicitly call
>>>> i915_gem_submit_request() from the submit_request handler to mark the
>>>> request as on the hardware. If we don't the request is always pending,
>>>> and any waiter will continue to wait indefinitely and hangcheck will not
>>>> be able to resolve the lockup.
>>>>
>>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
>>>> Testcase: igt/gem_eio/in-flight
>>>> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> ---
>>>>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>> index 7b9f5b99b0f3..7037a8b26903 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>>>>
>>>>  static void nop_submit_request(struct drm_i915_gem_request *request)
>>>>  {
>>>> +	i915_gem_request_submit(request);
>>>> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
>>>
>>> Slight deja vu but not sure - we don't have a way of marking these as
>>> failed so what happens in practice here? This as at the point of no
>>> return, no replay, or allowing the context to recover or something?
>
> This is past the point of no return. We failed to reset the hardware, so
> we need to catch all the inflight requests and signal them. Treating
> them in flight is nasty as they are a part of a giant web of
> dependencies, so I wanted to have them just complete quietly as they
> became ready (to be careful that we don't start dependent third party
> work before other third party work finishes).

Right, I wanted to confirm if I was reading it correctly. In this case:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko