[Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang

Tue Jan 3 13:17:19 UTC 2017

On 03/01/2017 12:38, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 12:13, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:46, Chris Wilson wrote:
>>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>>>> whether or not the request completed successfully, or in our case if we
>>>>>>> detected a hang during the request (signaled via -EIO in
>>>>>>> SYNC_IOC_FILE_INFO).
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>>>> 		ring_hung = false;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	if (ring_hung)
>>>>>>> +	if (ring_hung) {
>>>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>>>> -	else
>>>>>>> +		request->fence.status = -EIO;
>>>>>>> +	} else {
>>>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>>>> +	}
>>>>>>>
>>>>>>> 	if (!ring_hung)
>>>>>>> 		return;
>>>>>>>
>>>>>>
>>>>>> Reading what happens later in this function, should we set the
>>>>>> status of all the other requests we are about to clear?
>>>>>>
>>>>>> However one thing I don't understand is how this scheme interacts
>>>>>> with the current userspace. We will clear/no-nop some of the
>>>>>> submitted requests since the state is corrupt. But how will
>>>>>> userspace notice this before it submits more requets?
>>>>>
>>>>> There is no mechanism currently for user space to be able to detect a
>>>>> hung request. (It can use the uevent for async notification of the
>>>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>>>> can track the number of hangs it caused, but the delay makes any
>>>>> roundtripping impractical (i.e. you have to synchronise before all
>>>>> rendering if you must detect the event immediately). Note also that we
>>>>> do not want to give out interprocess information (i.e. to allow one
>>>>> client to spy on another), which makes things harder to get right.
>>>>
>>>> So idea is to clear already submitted requests _if_ the userspace is
>>>> synchronising before all rendering and looking at reset stats, to
>>>> make it theoretically possible to detect the corrupt state?
>>>
>>> No, I'm just don't see a way that userspace can detect the hang without
>>> testing after seeing the request signaled (either by waiting on the
>>> batch or by waiting on the fence), i.e. by being completely synchronous
>>> (or at least chosing its synchronous points very carefully, such as
>>> around IPC). It can either poll reset-count or use sync_file (which
>>> requires fence exporting).
>>>
>>> The current robustness interfaces is a basic query on whether any reset
>>> occurred within the context, not when.
>>
>> Why do we bother with clearing the submitted requests then?
>
> The same reason we ban processes from submitting new requests if they
> cause repeated hangs. If before we ban that client, it has already
> submitted 1000 hanging requests, it has successfully locked the machine
> up for a couple of hours.

So we would need to gate clearing on the transition to banned state I 
think. Because currently it does in unconditionally.

Regards,

Tvrtko