[PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out

Fri Oct 4 10:06:49 UTC 2024

On 04-10-2024 12:20, Matthew Brost wrote:
> On Thu, Oct 03, 2024 at 03:24:12PM +0530, Nilawar, Badal wrote:
>>
>>
>> On 02-10-2024 19:34, Matthew Brost wrote:
>>> On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
>>>>
>>>>
>>>> On 28-09-2024 02:57, Matthew Brost wrote:
>>>>> On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
>>>>>> It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
>>>>>> queued, but it didn't get opportunity to execute and timeout occurred.
>>>>>> To address this the g2h_worker is being flushed.
>>>>>>
>>>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>>>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
>>>>>>     1 file changed, 11 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> index 4b95f75b1546..4a5d7f85d1a0 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>>>>>     	}
>>>>>>     	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * It is observed that for above GuC CT request G2H IRQ triggered
>>>>>
>>>>> Where is this observed. 1 second is a long to wait for a worker...
>>>>
>>>> Please see this log.
>>>>
>>>
>>> Logs are good but explaining the test case is also helpful so I don't
>>> have reverse engineer things. Also having platform information would be
>>> helpful too. So what is the test case here and what platform?
>>
>> Sorry, my bad, I should have added issue id in commit message.
>> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.
>>
>> This issue is reported on LNL for xe_gt_freq at freq_reset_multiple test and
>> xe_pm@* tests during resume flow.
>>
>>>> [  176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0]
>>>> [  176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0]
>>>> [  176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
>>>
>>> This looks we are doing a GT reset and this is causing problems. This
>>> patch is likely papering over an issue with our GT flows. So this patch
>>> doesn't seem correct to me. Let's try to figure what is going wrong in
>>> the reset flow.
>>
>> This is seen for slpc query after "reset done" as well.
>>
>>>> [  176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query    //GuC PC
>>>> check request
>>>> [  176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]  // IRQ
>>>> [  177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
>>>> fence 1311, action 3003  //Timeout
>>>> [  177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
>>>> failed: -ETIME
>>>> [  177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
>>>
>>> Here this is almost 1 second after 'trying reset' which I'm unsure how
>>> that could happen looking at the source code upstream.
>>> 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
>>> queued' but that doesn't wait anywhere rather resolves to the below
>>> function:
>>>
>>> 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> 1770 {
>>> 1771         int ret;
>>> 1772
>>> 1773         /*
>>> 1774          * Using an atomic here rather than submission_state.lock as this
>>> 1775          * function can be called while holding the CT lock (engine reset
>>> 1776          * failure). submission_state.lock needs the CT lock to resubmit jobs.
>>> 1777          * Atomic is not ideal, but it works to prevent against concurrent reset
>>> 1778          * and releasing any TDRs waiting on guc->submission_state.stopped.
>>> 1779          */
>>> 1780         ret = atomic_fetch_or(1, &guc->submission_state.stopped);
>>> 1781         smp_wmb();
>>> 1782         wake_up_all(&guc->ct.wq);
>>> 1783
>>> 1784         return ret;
>>> 1785 }
>>
>> And CT is not disabled yet, so SLPC query will go through.
>>
> 
> I agree CT should not be disabled at this point.
> 
>>>
>>> If this log from an internal repo or something? This looks like some
>>> sort of circular dependency where a GT reset starts and the G2H handler
>>> doesn't get queued because the CT channel is disabled, the G2H times
>>> out, and reset stalls waiting for the timeout.
>>
>> This log is captured on LNL, with debug prints added, by running
>> xe_gt_freq at freq_reset_multiple.
>>
>> If CT channel is disabled then we will not see "G2H fence (1311) not
>> found!".
>>
>> During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().
>>
> 
> Ok, was spitballing ideas - if this is upstream then the CT should be
> alive but somehow it appears the worker to process CT is getting
> staled.
> 
> Also it very suspect the time gap between 'trying reset' and 'reset
> queued'.
> 
> This patch doesn't look like the solution. Can you look into hints I've
> given here?

Ok, I will look into it.

Badal

> 
> Matt
>   
>>
>>>
>>>> [  177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  178.632672] xe 0000:00:02.0: [drm] GT0: reset started
>>>> [  178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0] // Worker ran
>>>> [  178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
>>>>
>>>>>
>>>>>> +	 * and g2h_worker queued, but it didn't get opportunity to execute
>>>>>> +	 * and timeout occurred. To address the g2h_worker is being flushed.
>>>>>> +	 */
>>>>>> +	if (!ret) {
>>>>>> +		flush_work(&ct->g2h_worker);
>>>>>> +		ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>>>
>>>>> If this is needed I wouldn't wait 1 second, if the flush worked
>>>>> 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
>>>>
>>>> In config HZ is set to 250, which is 4 ms I think.
>>>>
>>>
>>> HZ should always be one second [1].
>>>
>>> [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
>>>
>>>> CONFIG_HZ_250=y
>>>> # CONFIG_HZ_300 is not set
>>>> # CONFIG_HZ_1000 is not set
>>>> CONFIG_HZ=250
>>>>
>>>
>>> I'm little confused how this Kconfig works [2] but I don't think
>>> actually changes the time of HZ rather it changes how many jiffies are
>>> in one second.
>>>
>>> [2] https://lwn.net/Articles/56378/
>>
>> Oh ok, Thanks for clarification.
>>
>> Regards,
>> Badal
>>
>>>
>>> Matt
>>>
>>>> Regards,
>>>> Badal
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> +	}
>>>>>> +
>>>>>>     	if (!ret) {
>>>>>>     		xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
>>>>>>     			  g2h_fence.seqno, action[0]);
>>>>>> -- 
>>>>>> 2.34.1
>>>>>>
>>>>
>>