[PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out

Thu Oct 3 09:54:12 UTC 2024

On 02-10-2024 19:34, Matthew Brost wrote:
> On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
>>
>>
>> On 28-09-2024 02:57, Matthew Brost wrote:
>>> On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
>>>> It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
>>>> queued, but it didn't get opportunity to execute and timeout occurred.
>>>> To address this the g2h_worker is being flushed.
>>>>
>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
>>>>    1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 4b95f75b1546..4a5d7f85d1a0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>>>    	}
>>>>    	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>> +
>>>> +	/*
>>>> +	 * It is observed that for above GuC CT request G2H IRQ triggered
>>>
>>> Where is this observed. 1 second is a long to wait for a worker...
>>
>> Please see this log.
>>
> 
> Logs are good but explaining the test case is also helpful so I don't
> have reverse engineer things. Also having platform information would be
> helpful too. So what is the test case here and what platform?

Sorry, my bad, I should have added issue id in commit message.
https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.

This issue is reported on LNL for xe_gt_freq at freq_reset_multiple test 
and xe_pm@* tests during resume flow.

>   
>> [  176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [  176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [  176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0]
>> [  176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [  176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [  176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0]
>> [  176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
> 
> This looks we are doing a GT reset and this is causing problems. This
> patch is likely papering over an issue with our GT flows. So this patch
> doesn't seem correct to me. Let's try to figure what is going wrong in
> the reset flow.

This is seen for slpc query after "reset done" as well.

>> [  176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query    //GuC PC
>> check request
>> [  176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]  // IRQ
>> [  177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
>> fence 1311, action 3003  //Timeout
>> [  177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
>> failed: -ETIME
>> [  177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
> 
> Here this is almost 1 second after 'trying reset' which I'm unsure how
> that could happen looking at the source code upstream.
> 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
> queued' but that doesn't wait anywhere rather resolves to the below
> function:
> 
> 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> 1770 {
> 1771         int ret;
> 1772
> 1773         /*
> 1774          * Using an atomic here rather than submission_state.lock as this
> 1775          * function can be called while holding the CT lock (engine reset
> 1776          * failure). submission_state.lock needs the CT lock to resubmit jobs.
> 1777          * Atomic is not ideal, but it works to prevent against concurrent reset
> 1778          * and releasing any TDRs waiting on guc->submission_state.stopped.
> 1779          */
> 1780         ret = atomic_fetch_or(1, &guc->submission_state.stopped);
> 1781         smp_wmb();
> 1782         wake_up_all(&guc->ct.wq);
> 1783
> 1784         return ret;
> 1785 }

And CT is not disabled yet, so SLPC query will go through.

> 
> If this log from an internal repo or something? This looks like some
> sort of circular dependency where a GT reset starts and the G2H handler
> doesn't get queued because the CT channel is disabled, the G2H times
> out, and reset stalls waiting for the timeout.

This log is captured on LNL, with debug prints added, by running 
xe_gt_freq at freq_reset_multiple.

If CT channel is disabled then we will not see "G2H fence (1311) not 
found!".

During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().

> 
>> [  177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [  177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [  178.632672] xe 0000:00:02.0: [drm] GT0: reset started
>> [  178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0] // Worker ran
>> [  178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
>>
>>>
>>>> +	 * and g2h_worker queued, but it didn't get opportunity to execute
>>>> +	 * and timeout occurred. To address the g2h_worker is being flushed.
>>>> +	 */
>>>> +	if (!ret) {
>>>> +		flush_work(&ct->g2h_worker);
>>>> +		ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>
>>> If this is needed I wouldn't wait 1 second, if the flush worked
>>> 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
>>
>> In config HZ is set to 250, which is 4 ms I think.
>>
> 
> HZ should always be one second [1].
> 
> [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
> 
>> CONFIG_HZ_250=y
>> # CONFIG_HZ_300 is not set
>> # CONFIG_HZ_1000 is not set
>> CONFIG_HZ=250
>>
> 
> I'm little confused how this Kconfig works [2] but I don't think
> actually changes the time of HZ rather it changes how many jiffies are
> in one second.
> 
> [2] https://lwn.net/Articles/56378/

Oh ok, Thanks for clarification.

Regards,
Badal

> 
> Matt
> 
>> Regards,
>> Badal
>>
>>>
>>> Matt
>>>
>>>> +	}
>>>> +
>>>>    	if (!ret) {
>>>>    		xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
>>>>    			  g2h_fence.seqno, action[0]);
>>>> -- 
>>>> 2.34.1
>>>>
>>