[Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang

Tue Jan 3 12:34:16 UTC 2017

On 03/01/2017 12:13, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 11:46, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>> whether or not the request completed successfully, or in our case if we
>>>>> detected a hang during the request (signaled via -EIO in
>>>>> SYNC_IOC_FILE_INFO).
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>> 		ring_hung = false;
>>>>> 	}
>>>>>
>>>>> -	if (ring_hung)
>>>>> +	if (ring_hung) {
>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>> -	else
>>>>> +		request->fence.status = -EIO;
>>>>> +	} else {
>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>> +	}
>>>>>
>>>>> 	if (!ring_hung)
>>>>> 		return;
>>>>>
>>>>
>>>> Reading what happens later in this function, should we set the
>>>> status of all the other requests we are about to clear?
>>>>
>>>> However one thing I don't understand is how this scheme interacts
>>>> with the current userspace. We will clear/no-nop some of the
>>>> submitted requests since the state is corrupt. But how will
>>>> userspace notice this before it submits more requets?
>>>
>>> There is no mechanism currently for user space to be able to detect a
>>> hung request. (It can use the uevent for async notification of the
>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>> can track the number of hangs it caused, but the delay makes any
>>> roundtripping impractical (i.e. you have to synchronise before all
>>> rendering if you must detect the event immediately). Note also that we
>>> do not want to give out interprocess information (i.e. to allow one
>>> client to spy on another), which makes things harder to get right.
>>
>> So idea is to clear already submitted requests _if_ the userspace is
>> synchronising before all rendering and looking at reset stats, to
>> make it theoretically possible to detect the corrupt state?
>
> No, I'm just don't see a way that userspace can detect the hang without
> testing after seeing the request signaled (either by waiting on the
> batch or by waiting on the fence), i.e. by being completely synchronous
> (or at least chosing its synchronous points very carefully, such as
> around IPC). It can either poll reset-count or use sync_file (which
> requires fence exporting).
>
> The current robustness interfaces is a basic query on whether any reset
> occurred within the context, not when.

Why do we bother with clearing the submitted requests then?

Regards,

Tvrtko