[PATCH 1/1] drm/xe: Don't short circuit TDR on jobs not started

Fri Oct 25 23:21:28 UTC 2024

On 10/25/2024 12:59, Matthew Brost wrote:
> On Fri, Oct 25, 2024 at 01:32:33PM -0600, Zanoni, Paulo R wrote:
>> On Wed, 2024-10-23 at 17:41 +0000, Matthew Brost wrote:
>>> On Wed, Oct 23, 2024 at 10:47:05AM -0600, Zanoni, Paulo R wrote:
>>>> On Tue, 2024-10-22 at 16:27 -0700, Matthew Brost wrote:
>>>>> Short circuiting TDR on jobs not started is an optimization which is not
>>>>> required. On LNL we are facing an issue where jobs do not get scheduled
>>>>> by the GuC for an unknown reason. Removing this optimization allows jobs
>>>>> to get scheduled after TDR fire once which is a big improvement. Remove
>>>>> this optimization for now while root causing job scheduling issue on
>>>>> LNL.
>>>> I just tested it and it seems to do what it promises. Thanks! Having a
>>>> 5 second hiccup is still horribly bad, but it is - checks math notes -
>>>> infinitely better than waiting forever for a syncobj that will never be
>>>> signaled.
>>>>
>>>> This patch will *tremendously* help Mesa CI, since we can reproduce
>>>> this bug all the time with Vulkan CTS tests.
>>>>
>>>> Suggestions:
>>>>
>>>> - Can we get a message on dmesg every time this hiccup happens? We're
>>>> not sure if it's happening on real workloads on people's machines, so
>>>> maybe having some sort of indication "oops, we just unstuck the batch
>>>> you submitted 300 frames ago!" would help.
>>>>
>>> We will add 'notice' level message if this occurs.
>> I may be wrong, but from what I understand, 'notice' level is something
>> that will *not* show up on people's dmesg if they are using distros'
>> default config. This message signals a bug is happening, we need to
>> make sure it appears in dmesg by default. The whole point is to be able
>> to figure out if this is happening in the wild. Can we promote this to
>> KERN_WARNING?
>>
> I'm honestly not sure what shows up where. 'notice' is same level as our
> job timeout message though. If we need to raise this level, the job
> timeout message should also be raised. To be safe, will roll both of
> these changes out in a series - I wanted to refactor my latest rev of
> this patch anyways.
>
> Matt
These are two different things, though. One (job timeout) is reporting a 
bug in user code, the other (job not started) is a temporary hack to 
workaround a problem on a specific platform. Yes?

My understanding is that bad userland is not supposed to generate kernel 
warnings. A warning (or error) is reserved for something that is not the 
user's fault but is a problem with the system. So the regular job 
timeout should not be at warning level. The temporary hack on the other 
hand, could be (if I understand the change correctly).

John.

>
>>>> - Since we don't know how long until the real fix, can this be tagged
>>>> for stable? If it turns out this requires special GuC, it would be even
>>>> more valuable to have this in stable since those tend to take more to
>>>> propagate to people's machines.
>>> I don't see any reason why this can't be backported, will include required tags.
>>>
>>> Matt
>>>
>>>> Thanks a lot!
>>>>
>>>>> Cc: Paulo Zanoni <paulo.r.zanoni at intel.com>
>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/xe/xe_guc_submit.c | 4 ----
>>>>>   1 file changed, 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> index 0b81972ff651..25ab675e9c7d 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> @@ -1052,10 +1052,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>>>>>   		exec_queue_killed_or_banned_or_wedged(q) ||
>>>>>   		exec_queue_destroyed(q);
>>>>>   
>>>>> -	/* Job hasn't started, can't be timed out */
>>>>> -	if (!skip_timeout_check && !xe_sched_job_started(job))
>>>>> -		goto rearm;
>>>>> -
>>>>>   	/*
>>>>>   	 * If devcoredump not captured and GuC capture for the job is not ready
>>>>>   	 * do manual capture first and decide later if we need to use it