[Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
John Harrison
john.c.harrison at intel.com
Wed Sep 28 18:27:33 UTC 2022
On 9/28/2022 00:19, Tvrtko Ursulin wrote:
> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>> Hi Andrzej,
>>>>>>
>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>>>> it should
>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>> removal is a
>>>>>>> good example.
>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>> run faster.
>>>>>>>
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>>>> fine for me:
>>>>>>
>>>>>> Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>>>
>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>> the GuC folks? Daniele, John?
>>>>>>
>>>>>> Andi
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>> trace_intel_context_reset(ce);
>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>> - capture_error_state(guc, ce);
>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>> + capture_error_state(guc, ce);
>>>
>>> I am not sure here - if we have a persistent context which caused a
>>> GPU hang I'd expect we'd still want error capture.
>>>
>>> What causes the reset in the affected IGTs? Always preemption timeout?
>>>
>>>>>>> guc_context_replay(ce);
>>>>>
>>>>> You definitely don't want to replay requests of a context that is
>>>>> going away.
>>>>
>>>> My intention was to just avoid error capture, but that's even
>>>> better, only condition change:
>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>
>>> Yes that helper was intended to be used for contexts which should
>>> not be scheduled post exit or ban.
>>>
>>> Daniele - you say there are some misses in the GuC backend. Should
>>> most, or even all in intel_guc_submission.c be converted to use
>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>> be a level up from the backends. Backend should only distinguish
>>> between "should I run this or not", and not the reason.
>>
>> I think that all of them should be updated, but I'd like Matt B to
>> confirm as he's more familiar with the code than me.
>
> Right, that sounds plausible to me as well.
>
> One thing I forgot to mention - the only place where backend can care
> between "schedulable" and "banned" is when it picks the preempt
> timeout for non-schedulable contexts. This is to only apply the strict
> 1ms to banned (so bad or naught contexts), while the ones which are
> exiting cleanly get the full preempt timeout as otherwise configured.
> This solves the ugly user experience quirk where GPU resets/errors
> were logged upon exit/Ctrl-C of a well behaving application (using
> non-persistent contexts). Hopefully GuC can match that behaviour so
> customers stay happy.
>
> Regards,
>
> Tvrtko
The whole revoke vs ban thing seems broken to me.
First of all, if the user hits Ctrl+C we need to kill the context off
immediately. That is a fundamental customer requirement. Render and
compute engines have a 7.5s pre-emption timeout. The user should not
have to wait 7.5s for a context to be removed from the system when they
have explicitly killed it themselves. Even the regular timeout of 640ms
is borderline a long time to wait. And note that there is an ongoing
request/requirement to increase that to 1900ms.
Under what circumstances would a user expect anything sensible to happen
after a Ctrl+C in terms of things finishing their rendering and display
nice pretty images? They killed the app. They want it dead. We should be
getting it off the hardware as quickly as possible. If you are really
concerned about resets causing collateral damage then maybe bump the
termination timeout from 1ms up to 10ms, maybe at most 100ms. If an app
is 'well behaved' then it should cleanly exit within 10ms. But if it is
bad (which is almost certainly the case if the user is manually and
explicitly killing it) then it needs to be killed because it is not
going to gracefully exit.
Secondly, the whole persistence thing is a total mess, completely broken
and intended to be massively simplified. See the internal task for it.
In short, the plan is that all contexts will be immediately killed when
the last DRM file handle is closed. Persistence is only valid between
the time the per context file handle is closed and the time the master
DRM handle is closed. Whereas, non-persistent contexts get killed as
soon as the per context handle is closed. There is absolutely no
connection to heartbeats or other irrelevant operations.
So in my view, the best option is to revert the ban vs revoke patch. It
is creating bugs. It is making persistence more complex not simpler. It
harms the user experience.
If the original problem was simply that error captures were being done
on Ctrl+C then the fix is simple. Don't capture for a banned context.
There is no need for all the rest of the revoke patch.
John.
More information about the Intel-gfx
mailing list