[Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context

Thu Sep 29 09:49:46 UTC 2022

On 29.09.2022 10:22, Tvrtko Ursulin wrote:
> 
> On 28/09/2022 19:27, John Harrison wrote:
>> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>>> Hi Andrzej,
>>>>>>>>
>>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>>> Capturing error state is time consuming (up to 350ms on DG2), 
>>>>>>>>> so it should
>>>>>>>>> be avoided if possible. Context reset triggered by context 
>>>>>>>>> removal is a
>>>>>>>>> good example.
>>>>>>>>> With this patch multiple igt tests will not timeout and should 
>>>>>>>>> run faster.
>>>>>>>>>
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>>>>>> fine for me:
>>>>>>>>
>>>>>>>> Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>>>>>
>>>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>>>> the GuC folks? Daniele, John?
>>>>>>>>
>>>>>>>> Andi
>>>>>>>>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> @@ -4425,7 +4425,8 @@ static void 
>>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>>>       trace_intel_context_reset(ce);
>>>>>>>>>         if (likely(!intel_context_is_banned(ce))) {
>>>>>>>>> -        capture_error_state(guc, ce);
>>>>>>>>> +        if (!intel_context_is_exiting(ce))
>>>>>>>>> +            capture_error_state(guc, ce);
>>>>>
>>>>> I am not sure here - if we have a persistent context which caused a 
>>>>> GPU hang I'd expect we'd still want error capture.
>>>>>
>>>>> What causes the reset in the affected IGTs? Always preemption timeout?
>>>>>
>>>>>>>>> guc_context_replay(ce);
>>>>>>>
>>>>>>> You definitely don't want to replay requests of a context that is 
>>>>>>> going away.
>>>>>>
>>>>>> My intention was to just avoid error capture, but that's even 
>>>>>> better, only condition change:
>>>>>> -        if (likely(!intel_context_is_banned(ce))) {
>>>>>> +       if (likely(intel_context_is_schedulable(ce)))  {
>>>>>
>>>>> Yes that helper was intended to be used for contexts which should 
>>>>> not be scheduled post exit or ban.
>>>>>
>>>>> Daniele - you say there are some misses in the GuC backend. Should 
>>>>> most, or even all in intel_guc_submission.c be converted to use 
>>>>> intel_context_is_schedulable? My idea indeed was that "ban" should 
>>>>> be a level up from the backends. Backend should only distinguish 
>>>>> between "should I run this or not", and not the reason.
>>>>
>>>> I think that all of them should be updated, but I'd like Matt B to 
>>>> confirm as he's more familiar with the code than me.
>>>
>>> Right, that sounds plausible to me as well.
>>>
>>> One thing I forgot to mention - the only place where backend can care 
>>> between "schedulable" and "banned" is when it picks the preempt 
>>> timeout for non-schedulable contexts. This is to only apply the 
>>> strict 1ms to banned (so bad or naught contexts), while the ones 
>>> which are exiting cleanly get the full preempt timeout as otherwise 
>>> configured. This solves the ugly user experience quirk where GPU 
>>> resets/errors were logged upon exit/Ctrl-C of a well behaving 
>>> application (using non-persistent contexts). Hopefully GuC can match 
>>> that behaviour so customers stay happy.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
>> The whole revoke vs ban thing seems broken to me.
>>
>> First of all, if the user hits Ctrl+C we need to kill the context off 
>> immediately. That is a fundamental customer requirement. Render and 
>> compute engines have a 7.5s pre-emption timeout. The user should not 
>> have to wait 7.5s for a context to be removed from the system when 
>> they have explicitly killed it themselves. Even the regular timeout of 
>> 640ms is borderline a long time to wait. And note that there is an 
>> ongoing request/requirement to increase that to 1900ms.
>>
>> Under what circumstances would a user expect anything sensible to 
>> happen after a Ctrl+C in terms of things finishing their rendering and 
>> display nice pretty images? They killed the app. They want it dead. We 
>> should be getting it off the hardware as quickly as possible. If you 
>> are really concerned about resets causing collateral damage then maybe 
>> bump the termination timeout from 1ms up to 10ms, maybe at most 100ms. 
>> If an app is 'well behaved' then it should cleanly exit within 10ms. 
>> But if it is bad (which is almost certainly the case if the user is 
>> manually and explicitly killing it) then it needs to be killed because 
>> it is not going to gracefully exit.
> 
> Right.. I had it like that initially (lower timeout - I think 20ms or 
> so, patch history on the mailing list would know for sure), but then 
> simplified it after review feedback to avoid adding another timeout value.
> 
> So it's not at all about any expectation that something should actually 
> finish to any sort of completion/success. It is primarily about not 
> logging an error message when there is no error. Thing to keep in mind 
> is that error messages are a big deal in some cultures. In addition to 
> that, avoiding needless engine resets is a good thing as well.
> 
> Previously the execlists backend was over eager and only allowed for 1ms 
> for such contexts to exit. If the context was banned sure - that means 
> it was a bad context which was causing many hangs already. But if the 
> context was a clean one I argue there is no point in doing an engine reset.
> 
> So if you want, I think it is okay to re-introduce a secondary timeout.
> 
> Or if you have an idea on how to avoid the error messages / GPU resets 
> when "friendly" contexts exit in some other way, that is also something 
> to discuss.
> 
>> Secondly, the whole persistence thing is a total mess, completely 
>> broken and intended to be massively simplified. See the internal task 
>> for it. In short, the plan is that all contexts will be immediately 
>> killed when the last DRM file handle is closed. Persistence is only 
>> valid between the time the per context file handle is closed and the 
>> time the master DRM handle is closed. Whereas, non-persistent contexts 
>> get killed as soon as the per context handle is closed. There is 
>> absolutely no connection to heartbeats or other irrelevant operations.
> 
> The change we are discussing is not about persistence, but for the 
> persistence itself - I am not sure it is completely broken and if, or 
> when, the internal task will result with anything being attempted. In 
> the meantime we had unhappy customers for more than a year. So do we 
> tell them "please wait for a few years more until some internal task 
> with no clear timeline or anyone assigned maybe gets looked at"?
> 
>> So in my view, the best option is to revert the ban vs revoke patch. 
>> It is creating bugs. It is making persistence more complex not 
>> simpler. It harms the user experience.
> 
> I am not aware of the bugs, even less so that it is harming user 
> experience!?
> 
> Bugs are limited to the GuC backend or in general? My CI runs were clean 
> so maybe test cases are lacking. Is it just a case of 
> s/intel_context_is_banned/intel_context_is_schedulable/ in there to fix it?
> 
> Again, the change was not about persistence. It is the opposite - 
> allowing non-persistent contexts to exit cleanly.
> 
>> If the original problem was simply that error captures were being done 
>> on Ctrl+C then the fix is simple. Don't capture for a banned context. 
>> There is no need for all the rest of the revoke patch.
> 
> Error capture was not part of the original story so it may be a 
> completely orthogonal topic that we are discussing it in this thread.

Wouldn't be good then to separate these two issues: 
banned/exiting/schedulable handling and error capturing of exiting context.
This patch handles only the latter, and as I understand there is no big 
controversy that we de not need capture errors for exiting contexts.
If yes, can we ack/merge this patch, to make CI happy and continue 
discussion on the former.

Regards
Andrzej

> 
> Regards,
> 
> Tvrtko