[Intel-gfx] [PATCH] drm/i915: fix exiting context timeout calculation
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Dec 2 13:13:11 UTC 2022
On 02/12/2022 12:19, Andrzej Hajda wrote:
> On 02.12.2022 10:14, Tvrtko Ursulin wrote:
>>
>> On 01/12/2022 16:36, Andrzej Hajda wrote:
>>> On 01.12.2022 11:28, Tvrtko Ursulin wrote:
>>>>
>>>> On 01/12/2022 00:22, John Harrison wrote:
>>>>> On 11/29/2022 00:43, Tvrtko Ursulin wrote:
>>>>>> On 28/11/2022 16:52, Andrzej Hajda wrote:
>>>>>>> In case context is exiting preempt_timeout_ms is used for timeout,
>>>>>>> but since introduction of DRM_I915_PREEMPT_TIMEOUT_COMPUTE it
>>>>>>> increases
>>>>>>> to 7.5 seconds. Heartbeat occurs earlier but it is still 2.5s.
>>>>>>>
>>>>>>> Fixes: d7a8680ec9fb21 ("drm/i915: Improve long running compute
>>>>>>> w/a for GuC submission")
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2410
>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>>>>> ---
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am not sure what is expected solution here, and if my patch
>>>>>>> does not
>>>>>>> actually reverts intentions of patch d7a8680ec9fb21. Feel free to
>>>>>>> propose
>>>>>>> something better.
>>>>>>> Other alternative would be to increase t/o in IGT tests, but I am
>>>>>>> not sure
>>>>>>> if this is good direction.
>>>>>>
>>>>>> Is it the hack with the FIXME marker from 47daf84a8bfb ("drm/i915:
>>>>>> Make the heartbeat play nice with long pre-emption timeouts") that
>>>>>> actually breaks things? (If IGT modifies the preempt timeout the
>>>>>> heartbeat extension will not work as intended.)
>>>>>>
>>>>>> If so, I think we agreed during review that was a weakness which
>>>>>> needs to be addressed, but I would need to re-read the old threads
>>>>>> to remember what was the plan. Regardless what it was it may be
>>>>>> time is now to continue with those improvements.
>>>>>>
>>>>> What is the actual issue? Just that closing contexts are taking
>>>>> forever to actually close? That would be the whole point of the
>>>>> 'context_is_exiting' patch. Which I still totally disagree with.
>>>>>
>>>>> If the context is being closed 'gracefully' and it is intended that
>>>>> it should be allowed time to pre-empt without being killed via an
>>>>> engine reset then the 7.5s delay is required. That is the
>>>>> officially agreed upon timeout to allow compute capable contexts to
>>>>> reach a pre-emption point before they should be killed. If an IGT
>>>>> is failing because it enforces a shorter timeout then the IGT needs
>>>>> to be updated to account for the fact that i915 has to support slow
>>>>> compute workloads.
>>>>>
>>>>> If the context is being closed 'forcefully' and should be killed
>>>>> immediately then you should be using the 'BANNED_PREEMPT_TIMEOUT'
>>>>> value not the sysfs/config value.
>>>>>
>>>>> Regarding heartbeats...
>>>>>
>>>>> The heartbeat period is 2.5s. But there are up to five heartbeat
>>>>> periods between the heartbeat starting and it declaring a hang. The
>>>>> patch you mention also introduced a check on the pre-emption
>>>>> timeout when the last period starts. If the pre-emption timeout is
>>>>> longer than the heartbeat period then the last period is extended
>>>>> to guarantee that a full pre-emption time is granted before
>>>>> declaring the hang.
>>>>>
>>>>> Are you saying that a heartbeat timeout is occurring and killing
>>>>> the system? Or are you just worried that something doesn't align
>>>>> correctly?
>>>>
>>>> I leave this to Andrzej since I am not the one debugging this. I
>>>> just glanced over the IGT and saw that there's code in there which
>>>> sets both the preempt timeout and heartbeat interval to non-default
>>>> values. And then I remembered this:
>>>
>>> The test is gem_ctx_persistence at many-contexts. It does not modify
>>> sysfs timeouts, but it assumes 1sec is enough to wait for exiting
>>> context (no-preemption). It works with bcs, vcs, vecs, but fails on
>>> rcs since it has
>>> timeout set to 7.5sec (btw it works with GuC submissions enabled). It
>>> seemed to me somehow inconsistent, but if this is how it should work
>>> I will just adjust the test.
>>
>> This looks odd then. That test is using non-preemptable spinners and
>> AFAICT it keeps submitting them for 30s, across all engines, and then
>> it stops and waits for one second for all of them to exit.
>>
>> With the 7.5 preempt timeout I'd expect test should fail both with GuC
>> and execlists.
>
> OK, my claim about working with GuC was not verified enough, just one
> testing machine.
>
>>
>> What should happen is that every context is marked as "exiting" and is
>> revoked. On the next scheduling event they would all be dropped.
>>
>> So I think two questions - how did increase of preempt timeout to 7.5s
>> pass CI - is the failure sporadic for instance?
>
> After some data mining on cibuglog from last month I can say results are
> mostly consistent per machine.
> On most machines it always passes.
> Always fails on shard-tgl*, shard-rkl-{1,2,3,4,6} (but on shard-rkl-5 it
> always passes), fi-adl-ddr5, fi-kbl-soraka, fi-rkl-11600.
> On re-dg2-{11,12,15} results are inconsistent - some passes, some fails.
>
>>
>> Second question - you are saying with GuC test always passes - how
>> does GuC manages to revoke a non-preemptible spinner in less than one
>> second if preempt timeout is 7.5s.. colour me confused.
>>
>> Anyway those questions are secondary.. Fix here I think pretty
>> obviously is for many_contexts() to fetch the preempt timeout from
>> sysfs and allow for that much time (plus a safety factor). Use the
>> longest timeout between all engines since all are submitted to.
>
> With increasing to 10 seconds the issue disappeared on two RIL machines
> used for tests, but I will post the patch on try-bot check other
> machines as well.
>
> One more thing, to be sure. As I understand reset due to stopped
> heartbeat, should not happen for 7.5sec preemption timeouts, if test do
> not adjust any timeouts? If yes then there is sth wrong anyway.
> See sample logs from dg2 showing what happens: pass[1], fail[2].
> In both cases there is 22 "heartbeat * not ticking" log lines, all on
> 7.5s preemption_timeouts (rcs, ccs).
>
> [1]:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12449/re-dg2-12/igt@gem_ctx_persistence@many-contexts.html
<6> [278.652441] [IGT] gem_ctx_persistence: executing
<7> [278.663699] i915 0000:03:00.0: [drm:i915_gem_open [i915]]
<7> [278.664362] i915 0000:03:00.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x0000005c [0x0000005c]
<7> [278.664496] i915 0000:03:00.0: [drm:i915_gem_open [i915]]
<7> [278.664992] i915 0000:03:00.0: [drm:i915_gem_open [i915]]
<6> [278.670968] [IGT] gem_ctx_persistence: starting subtest many-contexts
<7> [278.671164] [drm:eb_lookup_vmas [i915]] EINVAL at eb_validate_vma:505
<7> [278.686769] i915 0000:03:00.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x000001dc [0x000001dc]
<6> [279.367025] i915 0000:03:00.0: [drm] Ignoring context reset notification of exiting context 0x100C on bcs0
<6> [279.368863] i915 0000:03:00.0: [drm] Ignoring context reset notification of exiting context 0x100D on vcs0
<6> [279.370813] i915 0000:03:00.0: [drm] Ignoring context reset notification of exiting context 0x100E on vcs1
<6> [279.373360] i915 0000:03:00.0: [drm] Ignoring context reset notification of exiting context 0x100F on vecs0
<6> [279.376086] i915 0000:03:00.0: [drm] Ignoring context reset notification of exiting context 0x1010 on vecs1
<7> [281.964427] heartbeat ccs3 heartbeat {seqno:e:190, prio:2147483646} not ticking
<7> [281.964449] heartbeat Awake? 2
<7> [281.964457] heartbeat Barriers?: no
<7> [281.964465] heartbeat Latency: 0us
<7> [281.964499] heartbeat Runtime: 292051ms
<7> [281.964507] heartbeat Forcewake: 0 domains, 0 active
<7> [281.964516] heartbeat Heartbeat: 3224 ms ago
<7> [281.964525] heartbeat Reset count: 0 (global 28)
<7> [281.964533] heartbeat Properties:
<7> [281.964539] heartbeat heartbeat_interval_ms: 2500 [default 2500]
<7> [281.964548] heartbeat max_busywait_duration_ns: 8000 [default 8000]
<7> [281.964556] heartbeat preempt_timeout_ms: 7500 [default 7500]
<7> [281.964564] heartbeat stop_timeout_ms: 100 [default 100]
<7> [281.964571] heartbeat timeslice_duration_ms: 1 [default 1]
<7> [281.964580] heartbeat Requests:
<7> [281.964663] heartbeat active in queueE e:190* prio=2147483646 @ 3224ms: [i915]
<7> [281.964677] heartbeat ring->start: 0xfecb0000
<7> [281.964685] heartbeat ring->head: 0x000008a0
<7> [281.964692] heartbeat ring->tail: 0x00000758
<7> [281.964698] heartbeat ring->emit: 0x00000760
<7> [281.964705] heartbeat ring->space: 0x00000100
This indeed looks super strange - heartbeat at max prio and not ticking 3224ms after first created. No idea. One for GuC experts if GuC is the only backend where this happens.
Well then I looked at the failures on TGL you mention above:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_7078/shard-tglb3/igt@gem_ctx_persistence@many-contexts.html
<6> [331.908547] [IGT] gem_ctx_persistence: executing
<7> [331.910926] i915 0000:00:02.0: [drm:i915_gem_open [i915]]
<7> [331.911475] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x0000005c [0x0000005c]
<7> [331.911608] i915 0000:00:02.0: [drm:i915_gem_open [i915]]
<7> [331.911924] i915 0000:00:02.0: [drm:i915_gem_open [i915]]
<6> [331.915834] [IGT] gem_ctx_persistence: starting subtest many-contexts
...
<7> [335.249250] heartbeat rcs0 heartbeat {seqno:5:76, prio:2147483646} not ticking
<7> [335.249269] heartbeat Awake? 1338
<7> [335.249273] heartbeat Barriers?: no
<7> [335.249277] heartbeat Latency: 243us
<7> [335.249291] heartbeat Runtime: 10531ms
<7> [335.249294] heartbeat Forcewake: 0 domains, 0 active
<7> [335.249297] heartbeat Heartbeat: 3307 ms ago
<7> [335.249305] heartbeat Reset count: 0 (global 0)
<7> [335.249308] heartbeat Properties:
<7> [335.249310] heartbeat heartbeat_interval_ms: 2500 [default 2500]
<7> [335.249314] heartbeat max_busywait_duration_ns: 8000 [default 8000]
<7> [335.249318] heartbeat preempt_timeout_ms: 7500 [default 7500]
<7> [335.249322] heartbeat stop_timeout_ms: 100 [default 100]
<7> [335.249326] heartbeat timeslice_duration_ms: 1 [default 1]
<7> [335.249336] heartbeat Requests:
<7> [335.249387] heartbeat hungR 13a:2* prio=0 @ 3308ms: gem_ctx_persist<1773>
<7> [335.249393] heartbeat ring->start: 0x0149f000
<7> [335.249395] heartbeat ring->head: 0x00000000
<7> [335.249398] heartbeat ring->tail: 0x000000b0
<7> [335.249402] heartbeat ring->emit: 0x000000b8
<7> [335.249404] heartbeat ring->space: 0x00003f08
<7> [335.249407] heartbeat ring->hwsp: 0xfedc4000
Same thing. So either something is totally broken or I totally forgot how things are supposed to work.
There shouldn't be a hearbeat stopped ticking until 4x hearbeat intervals + preempt timeout. And it nicely shows the current engine values for those so it makes no sense.
Was there something left running before the test started? But drop_caches was done on test start..
Regards,
Tvrtko
> [2]:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12457/re-dg2-12/igt@gem_ctx_persistence@many-contexts.html
>
> Regards
> Andrzej
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Regards
>>> Andrzej
>>>
>>>
>>>>
>>>> next_heartbeat():
>>>> ...
>>>> /*
>>>> * FIXME: The final period extension is disabled if the
>>>> period has been
>>>> * modified from the default. This is to prevent issues
>>>> with certain
>>>> * selftests which override the value and expect specific
>>>> behaviour.
>>>> * Once the selftests have been updated to either cope with
>>>> variable
>>>> * heartbeat periods (or to override the pre-emption
>>>> timeout as well,
>>>> * or just to add a selftest specific override of the
>>>> extension), the
>>>> * generic override can be removed.
>>>> */
>>>> if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>>> delay == engine->defaults.heartbeat_interval_ms) {
>>>>
>>>> Which then wouldn't dtrt with last heartbeat pulse extensions, if
>>>> the IGT would be relying on that. Don't know, just pointing out to
>>>> check and see if this FIXME needs to be prioritised.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>
More information about the Intel-gfx
mailing list