[Intel-gfx] [RFC 7/9] drm/i915: Interrupt driven fences

Wed Aug 5 04:05:27 PDT 2015

Op 05-08-15 om 10:05 schreef Daniel Vetter:
> On Mon, Aug 03, 2015 at 10:20:29AM +0100, Tvrtko Ursulin wrote:
>> On 07/27/2015 03:00 PM, Daniel Vetter wrote:
>>> On Mon, Jul 27, 2015 at 02:20:43PM +0100, Tvrtko Ursulin wrote:
>>>> On 07/17/2015 03:31 PM, John.C.Harrison at Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>
>>>>> The intended usage model for struct fence is that the signalled status should be
>>>>> set on demand rather than polled. That is, there should not be a need for a
>>>>> 'signaled' function to be called everytime the status is queried. Instead,
>>>>> 'something' should be done to enable a signal callback from the hardware which
>>>>> will update the state directly. In the case of requests, this is the seqno
>>>>> update interrupt. The idea is that this callback will only be enabled on demand
>>>>> when something actually tries to wait on the fence.
>>>>>
>>>>> This change removes the polling test and replaces it with the callback scheme.
>>>>> Each fence is added to a 'please poke me' list at the start of
>>>>> i915_add_request(). The interrupt handler then scans through the 'poke me' list
>>>>> when a new seqno pops out and signals any matching fence/request. The fence is
>>>>> then removed from the list so the entire request stack does not need to be
>>>>> scanned every time. Note that the fence is added to the list before the commands
>>>>> to generate the seqno interrupt are added to the ring. Thus the sequence is
>>>>> guaranteed to be race free if the interrupt is already enabled.
>>>>>
>>>>> Note that the interrupt is only enabled on demand (i.e. when __wait_request() is
>>>>> called). Thus there is still a potential race when enabling the interrupt as the
>>>>> request may already have completed. However, this is simply solved by calling
>>>>> the interrupt processing code immediately after enabling the interrupt and
>>>>> thereby checking for already completed requests.
>>>>>
>>>>> Lastly, the ring clean up code has the possibility to cancel outstanding
>>>>> requests (e.g. because TDR has reset the ring). These requests will never get
>>>>> signalled and so must be removed from the signal list manually. This is done by
>>>>> setting a 'cancelled' flag and then calling the regular notify/retire code path
>>>>> rather than attempting to duplicate the list manipulatation and clean up code in
>>>>> multiple places. This also avoid any race condition where the cancellation
>>>>> request might occur after/during the completion interrupt actually arriving.
>>>>>
>>>>> v2: Updated to take advantage of the request unreference no longer requiring the
>>>>> mutex lock.
>>>>>
>>>>> For: VIZ-5190
>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>> ---
>>>> [snip]
>>>>
>>>>> @@ -1382,6 +1387,10 @@ static void i915_gem_request_retire(struct drm_i915_gem_request *request)
>>>>>  	list_del_init(&request->list);
>>>>>  	i915_gem_request_remove_from_client(request);
>>>>>
>>>>> +	/* In case the request is still in the signal pending list */
>>>>> +	if (!list_empty(&request->signal_list))
>>>>> +		request->cancelled = true;
>>>>> +
>>>> Another thing I did not see implemented is the sync_fence error state.
>>>>
>>>> This is more about the Android part, but related to this canceled flag so I
>>>> am commenting here.
>>>>
>>>> I thought when TDR kicks in and we set request->cancelled to true, there
>>>> should be a code path which somehow makes sync_fence->status negative.
>>>>
>>>> As it is, because fence_signal will not be called on canceled, I thought
>>>> waiters will wait until timeout, rather than being woken up and return error
>>>> status.
>>>>
>>>> For this to work you would somehow need to make sync_fence->status go
>>>> negative. With normal fence completion it goes from 1 -> 0, via the
>>>> completion callback. I did not immediately see how to make it go negative
>>>> using the existing API.
>>> I think back when we did struct fence we decided that we won't care yet
>>> about forwarding error state since doing that across drivers if you have a
>>> chain of fences looked complicated. And no one had any clear idea about
>>> what kind of semantics we really want. If we want this we'd need to add
>>> it, but probably better to do that as a follow-up (usual caveat about
>>> open-source userspace and demonstration vehicles apply and all that).
>> Hm, I am not sure but it feels to me not having an error state is a problem.
>> Without it userspace can just keep waiting and waiting upon a fence even
>> though the driver has discarded that workload and never plans to resubmit
>> it. Am I missing something?
> fences still must complete eventually, they simply won't tell you whether
> the gpu died before completing the fence or not. Which is in line with gl
> spec for fences and arb_robustness - inquiring about gpu hangs is done
> through sideband apis.
Actually you can tell. Set fence->status before signalling.

~Maarten