[Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Mon Oct 3 07:53:37 UTC 2022
On 30/09/2022 18:44, John Harrison wrote:
> On 9/30/2022 02:22, Tvrtko Ursulin wrote:
>> On 29/09/2022 17:21, John Harrison wrote:
>>> On 9/29/2022 00:42, Tvrtko Ursulin wrote:
>>>> On 29/09/2022 03:18, John.C.Harrison at Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>
>>>>> Compute workloads are inherently not pre-emptible for long periods on
>>>>> current hardware. As a workaround for this, the pre-emption timeout
>>>>> for compute capable engines was disabled. This is undesirable with GuC
>>>>> submission as it prevents per engine reset of hung contexts. Hence the
>>>>> next patch will re-enable the timeout but bumped up by an order of
>>>>> magnitude.
>>>>>
>>>>> However, the heartbeat might not respect that. Depending upon current
>>>>> activity, a pre-emption to the heartbeat pulse might not even be
>>>>> attempted until the last heartbeat period. Which means that only one
>>>>> period is granted for the pre-emption to occur. With the aforesaid
>>>>> bump, the pre-emption timeout could be significantly larger than this
>>>>> heartbeat period.
>>>>>
>>>>> So adjust the heartbeat code to take the pre-emption timeout into
>>>>> account. When it reaches the final (high priority) period, it now
>>>>> ensures the delay before hitting reset is bigger than the pre-emption
>>>>> timeout.
>>>>>
>>>>> v2: Fix for selftests which adjust the heartbeat period manually.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>> ---
>>>>> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 19
>>>>> +++++++++++++++++++
>>>>> 1 file changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> index a3698f611f457..823a790a0e2ae 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> @@ -22,9 +22,28 @@
>>>>> static bool next_heartbeat(struct intel_engine_cs *engine)
>>>>> {
>>>>> + struct i915_request *rq;
>>>>> long delay;
>>>>> delay = READ_ONCE(engine->props.heartbeat_interval_ms);
>>>>> +
>>>>> + rq = engine->heartbeat.systole;
>>>>> +
>>>>> + if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>>>> + delay == engine->defaults.heartbeat_interval_ms) {
>>>>
>>>> Maybe I forgot but what is the reason for the check against the
>>>> default heartbeat interval?
>>> That's the 'v2: fix for selftests that manually adjust the
>>> heartbeat'. If something (or someone) has explicitly set an override
>>> of the heartbeat then it has to be assumed that they know what they
>>> are doing, and if things don't work any more that's their problem.
>>> But if we don't respect their override then they won't get the
>>> timings they expect and the selftest will fail.
>>
>> Isn't this a bit too strict for the non-selftest case? If the new
>> concept is extending the last pulse to guarantee preemption, then I
>> think we could allow tweaking of the heartbeat period. Like what if
>> user wants 1s, or 10s instead of 2.5s - why would that need to break
>> the improvement from this patch?
> Then the user is back to where they were before this patch.
>
>>
>> In what ways selftests fail? Are they trying to guess time to reset
>> based on the hearbeat period set? If so perhaps add a helper to query
>> it based on the last pulse extension.
>
> I don't recall. It was six months ago when I was actually working on
> this. And right now I do not have the time to go back and re-run all the
> testing and re-write a bunch of self tests with whole new helpers and
> algorithms and whatever else might be necessary to polish this to
> perfection. And in the meantime, all the existing issues are still
> present - there is no range checking on any of this stuff, it is very
> possible for a driver with default settings to break a legal workload
> because the heartbeat and pre-emption are fighting with each other, we
> don't even have per engine resets enabled, etc.
>
> Maybe it could be even better with a follow up patch. Feel free to do
> that. But as it stands, this patch set significantly improves the
> situation without making anything worse.
As we seem to be in agreement that the check against default heartbeat
is a hack with only purpose to work around assumptions made by
selftests, then please file a Jira about removing it (this hack). Then
work can be assigned to someone to clean it up. With that done I would
agree the series is indeed an improvement and it would have my ack.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list