[Intel-gfx] [RFC 7/9] drm/i915: Interrupt driven fences

Mon Aug 3 02:20:29 PDT 2015

On 07/27/2015 03:00 PM, Daniel Vetter wrote:
> On Mon, Jul 27, 2015 at 02:20:43PM +0100, Tvrtko Ursulin wrote:
>>
>> On 07/17/2015 03:31 PM, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> The intended usage model for struct fence is that the signalled status should be
>>> set on demand rather than polled. That is, there should not be a need for a
>>> 'signaled' function to be called everytime the status is queried. Instead,
>>> 'something' should be done to enable a signal callback from the hardware which
>>> will update the state directly. In the case of requests, this is the seqno
>>> update interrupt. The idea is that this callback will only be enabled on demand
>>> when something actually tries to wait on the fence.
>>>
>>> This change removes the polling test and replaces it with the callback scheme.
>>> Each fence is added to a 'please poke me' list at the start of
>>> i915_add_request(). The interrupt handler then scans through the 'poke me' list
>>> when a new seqno pops out and signals any matching fence/request. The fence is
>>> then removed from the list so the entire request stack does not need to be
>>> scanned every time. Note that the fence is added to the list before the commands
>>> to generate the seqno interrupt are added to the ring. Thus the sequence is
>>> guaranteed to be race free if the interrupt is already enabled.
>>>
>>> Note that the interrupt is only enabled on demand (i.e. when __wait_request() is
>>> called). Thus there is still a potential race when enabling the interrupt as the
>>> request may already have completed. However, this is simply solved by calling
>>> the interrupt processing code immediately after enabling the interrupt and
>>> thereby checking for already completed requests.
>>>
>>> Lastly, the ring clean up code has the possibility to cancel outstanding
>>> requests (e.g. because TDR has reset the ring). These requests will never get
>>> signalled and so must be removed from the signal list manually. This is done by
>>> setting a 'cancelled' flag and then calling the regular notify/retire code path
>>> rather than attempting to duplicate the list manipulatation and clean up code in
>>> multiple places. This also avoid any race condition where the cancellation
>>> request might occur after/during the completion interrupt actually arriving.
>>>
>>> v2: Updated to take advantage of the request unreference no longer requiring the
>>> mutex lock.
>>>
>>> For: VIZ-5190
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>
>> [snip]
>>
>>> @@ -1382,6 +1387,10 @@ static void i915_gem_request_retire(struct drm_i915_gem_request *request)
>>>   	list_del_init(&request->list);
>>>   	i915_gem_request_remove_from_client(request);
>>>
>>> +	/* In case the request is still in the signal pending list */
>>> +	if (!list_empty(&request->signal_list))
>>> +		request->cancelled = true;
>>> +
>>
>> Another thing I did not see implemented is the sync_fence error state.
>>
>> This is more about the Android part, but related to this canceled flag so I
>> am commenting here.
>>
>> I thought when TDR kicks in and we set request->cancelled to true, there
>> should be a code path which somehow makes sync_fence->status negative.
>>
>> As it is, because fence_signal will not be called on canceled, I thought
>> waiters will wait until timeout, rather than being woken up and return error
>> status.
>>
>> For this to work you would somehow need to make sync_fence->status go
>> negative. With normal fence completion it goes from 1 -> 0, via the
>> completion callback. I did not immediately see how to make it go negative
>> using the existing API.
>
> I think back when we did struct fence we decided that we won't care yet
> about forwarding error state since doing that across drivers if you have a
> chain of fences looked complicated. And no one had any clear idea about
> what kind of semantics we really want. If we want this we'd need to add
> it, but probably better to do that as a follow-up (usual caveat about
> open-source userspace and demonstration vehicles apply and all that).

Hm, I am not sure but it feels to me not having an error state is a 
problem. Without it userspace can just keep waiting and waiting upon a 
fence even though the driver has discarded that workload and never plans 
to resubmit it. Am I missing something?

Tvrtko