[Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Mon Oct 3 07:53:37 UTC 2022


On 30/09/2022 18:44, John Harrison wrote:
> On 9/30/2022 02:22, Tvrtko Ursulin wrote:
>> On 29/09/2022 17:21, John Harrison wrote:
>>> On 9/29/2022 00:42, Tvrtko Ursulin wrote:
>>>> On 29/09/2022 03:18, John.C.Harrison at Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>
>>>>> Compute workloads are inherently not pre-emptible for long periods on
>>>>> current hardware. As a workaround for this, the pre-emption timeout
>>>>> for compute capable engines was disabled. This is undesirable with GuC
>>>>> submission as it prevents per engine reset of hung contexts. Hence the
>>>>> next patch will re-enable the timeout but bumped up by an order of
>>>>> magnitude.
>>>>>
>>>>> However, the heartbeat might not respect that. Depending upon current
>>>>> activity, a pre-emption to the heartbeat pulse might not even be
>>>>> attempted until the last heartbeat period. Which means that only one
>>>>> period is granted for the pre-emption to occur. With the aforesaid
>>>>> bump, the pre-emption timeout could be significantly larger than this
>>>>> heartbeat period.
>>>>>
>>>>> So adjust the heartbeat code to take the pre-emption timeout into
>>>>> account. When it reaches the final (high priority) period, it now
>>>>> ensures the delay before hitting reset is bigger than the pre-emption
>>>>> timeout.
>>>>>
>>>>> v2: Fix for selftests which adjust the heartbeat period manually.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>> ---
>>>>>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 19 
>>>>> +++++++++++++++++++
>>>>>   1 file changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> index a3698f611f457..823a790a0e2ae 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> @@ -22,9 +22,28 @@
>>>>>     static bool next_heartbeat(struct intel_engine_cs *engine)
>>>>>   {
>>>>> +    struct i915_request *rq;
>>>>>       long delay;
>>>>>         delay = READ_ONCE(engine->props.heartbeat_interval_ms);
>>>>> +
>>>>> +    rq = engine->heartbeat.systole;
>>>>> +
>>>>> +    if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>>>> +        delay == engine->defaults.heartbeat_interval_ms) {
>>>>
>>>> Maybe I forgot but what is the reason for the check against the 
>>>> default heartbeat interval?
>>> That's the 'v2: fix for selftests that manually adjust the 
>>> heartbeat'. If something (or someone) has explicitly set an override 
>>> of the heartbeat then it has to be assumed that they know what they 
>>> are doing, and if things don't work any more that's their problem. 
>>> But if we don't respect their override then they won't get the 
>>> timings they expect and the selftest will fail.
>>
>> Isn't this a bit too strict for the non-selftest case? If the new 
>> concept is extending the last pulse to guarantee preemption, then I 
>> think we could allow tweaking of the heartbeat period. Like what if 
>> user wants 1s, or 10s instead of 2.5s - why would that need to break 
>> the improvement from this patch?
> Then the user is back to where they were before this patch.
> 
>>
>> In what ways selftests fail? Are they trying to guess time to reset 
>> based on the hearbeat period set? If so perhaps add a helper to query 
>> it based on the last pulse extension.
> 
> I don't recall. It was six months ago when I was actually working on 
> this. And right now I do not have the time to go back and re-run all the 
> testing and re-write a bunch of self tests with whole new helpers and 
> algorithms and whatever else might be necessary to polish this to 
> perfection. And in the meantime, all the existing issues are still 
> present - there is no range checking on any of this stuff, it is very 
> possible for a driver with default settings to break a legal workload 
> because the heartbeat and pre-emption are fighting with each other, we 
> don't even have per engine resets enabled, etc.
> 
> Maybe it could be even better with a follow up patch. Feel free to do 
> that. But as it stands, this patch set significantly improves the 
> situation without making anything worse.

As we seem to be in agreement that the check against default heartbeat 
is a hack with only purpose to work around assumptions made by 
selftests, then please file a Jira about removing it (this hack). Then 
work can be assigned to someone to clean it up. With that done I would 
agree the series is indeed an improvement and it would have my ack.

Regards,

Tvrtko


More information about the Intel-gfx mailing list