[Intel-gfx] [RFC 0/8] Force preemption

Thu Mar 22 17:41:57 UTC 2018

On 22/03/2018 16:01, Jeff McGee wrote:
> On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
>>
>> On 22/03/2018 14:34, Jeff McGee wrote:
>>> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
>>>> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
>>>>>
>>>>> On 21/03/2018 17:26, jeff.mcgee at intel.com wrote:
>>>>>> From: Jeff McGee <jeff.mcgee at intel.com>
>>>>>>
>>>>>> Force preemption uses engine reset to enforce a limit on the time
>>>>>> that a request targeted for preemption can block. This feature is
>>>>>> a requirement in automotive systems where the GPU may be shared by
>>>>>> clients of critically high priority and clients of low priority that
>>>>>> may not have been curated to be preemption friendly. There may be
>>>>>> more general applications of this feature. I'm sharing as an RFC to
>>>>>> stimulate that discussion and also to get any technical feedback
>>>>>> that I can before submitting to the product kernel that needs this.
>>>>>> I have developed the patches for ease of rebase, given that this is
>>>>>> for the moment considered a non-upstreamable feature. It would be
>>>>>> possible to refactor hangcheck to fully incorporate force preemption
>>>>>> as another tier of patience (or impatience) with the running request.
>>>>>
>>>>> Sorry if it was mentioned elsewhere and I missed it - but does this work
>>>>> only with stateless clients - or in other words, what would happen to
>>>>> stateful clients which would be force preempted? Or the answer is we
>>>>> don't care since they are misbehaving?
>>>>
>>>> They get notified of being guilty for causing a gpu reset; three strikes
>>>> and they are out (banned from using the gpu) using the current rules.
>>>> This is a very blunt hammer that requires the rest of the system to be
>>>> robust; one might argue time spent making the system robust would be
>>>> better served making sure that the timer never expired in the first place
>>>> thereby eliminating the need for a forced gpu reset.
>>>> -Chris
>>>
>>> Yes, for simplication the policy applied to force preempted contexts
>>> is the same as for hanging contexts. It is known that this feature
>>> should not be required in a fully curated system. It's a requirement
>>> if end user will be alllowed to install 3rd party apps to run in the
>>> non-critical domain.
>>
>> My concern is whether it safe to call this force _preemption_, while
>> it is not really expected to work as preemption from the point of
>> view of preempted context. I may be missing some angle here, but I
>> think a better name would include words like maximum request
>> duration or something.
>>
>> I can see a difference between allowed maximum duration when there
>> is something else pending, and when it isn't, but I don't
>> immediately see that we should consider this distinction for any
>> real benefit?
>>
>> So should the feature just be "maximum request duration"? This would
>> perhaps make it just a special case of hangcheck, which ignores head
>> progress, or whatever we do in there.
>>
>> Regards,
>>
>> Tvrtko
> 
> I think you might be unclear about how this works. We're not starting a
> preemption to see if we can cleanly remove a request who has begun to
> exceed its normal time slice, i.e. hangcheck. This is about bounding
> the time that a normal preemption can take. So first start preemption
> in response to higher-priority request arrival, then wait for preemption
> to complete within a certain amount of time. If it does not, resort to
> reset.
> 
> So it's really "force the resolution of a preemption", shortened to
> "force preemption".

You are right, I veered off in my thinking and ended up with something 
different. :)

I however still think the name is potentially misleading, since the 
request/context is not getting preempted. It is getting effectively 
killed (sooner or later, directly or indirectly).

Maybe that is OK for the specific use case when everything is only 
broken and not malicious.

In a more general purpose system it would be a bit random when something 
would work, and when it wouldn't, depending on system setup and even 
timings.

Hm, maybe you don't even really benefit from the standard three strikes 
and you are out policy, and for this specific use case, you should just 
kill it straight away. If it couldn't be preempted once, why pay the 
penalty any more?

If you don't have it already, devising a solution which blacklists the 
process (if it creates more contexts), or even a parent (if forking is 
applicable and implementation feasible), for offenders could also be 
beneficial.

Regards,

Tvrtko