[PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out
Nilawar, Badal
badal.nilawar at intel.com
Thu Oct 3 09:54:12 UTC 2024
On 02-10-2024 19:34, Matthew Brost wrote:
> On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
>>
>>
>> On 28-09-2024 02:57, Matthew Brost wrote:
>>> On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
>>>> It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
>>>> queued, but it didn't get opportunity to execute and timeout occurred.
>>>> To address this the g2h_worker is being flushed.
>>>>
>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
>>>> 1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 4b95f75b1546..4a5d7f85d1a0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>>> }
>>>> ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>> +
>>>> + /*
>>>> + * It is observed that for above GuC CT request G2H IRQ triggered
>>>
>>> Where is this observed. 1 second is a long to wait for a worker...
>>
>> Please see this log.
>>
>
> Logs are good but explaining the test case is also helpful so I don't
> have reverse engineer things. Also having platform information would be
> helpful too. So what is the test case here and what platform?
Sorry, my bad, I should have added issue id in commit message.
https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.
This issue is reported on LNL for xe_gt_freq at freq_reset_multiple test
and xe_pm@* tests during resume flow.
>
>> [ 176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [ 176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [ 176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0]
>> [ 176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [ 176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [ 176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0]
>> [ 176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
>
> This looks we are doing a GT reset and this is causing problems. This
> patch is likely papering over an issue with our GT flows. So this patch
> doesn't seem correct to me. Let's try to figure what is going wrong in
> the reset flow.
This is seen for slpc query after "reset done" as well.
>> [ 176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query //GuC PC
>> check request
>> [ 176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0] // IRQ
>> [ 177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
>> fence 1311, action 3003 //Timeout
>> [ 177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
>> failed: -ETIME
>> [ 177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
>
> Here this is almost 1 second after 'trying reset' which I'm unsure how
> that could happen looking at the source code upstream.
> 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
> queued' but that doesn't wait anywhere rather resolves to the below
> function:
>
> 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> 1770 {
> 1771 int ret;
> 1772
> 1773 /*
> 1774 * Using an atomic here rather than submission_state.lock as this
> 1775 * function can be called while holding the CT lock (engine reset
> 1776 * failure). submission_state.lock needs the CT lock to resubmit jobs.
> 1777 * Atomic is not ideal, but it works to prevent against concurrent reset
> 1778 * and releasing any TDRs waiting on guc->submission_state.stopped.
> 1779 */
> 1780 ret = atomic_fetch_or(1, &guc->submission_state.stopped);
> 1781 smp_wmb();
> 1782 wake_up_all(&guc->ct.wq);
> 1783
> 1784 return ret;
> 1785 }
And CT is not disabled yet, so SLPC query will go through.
>
> If this log from an internal repo or something? This looks like some
> sort of circular dependency where a GT reset starts and the G2H handler
> doesn't get queued because the CT channel is disabled, the G2H times
> out, and reset stalls waiting for the timeout.
This log is captured on LNL, with debug prints added, by running
xe_gt_freq at freq_reset_multiple.
If CT channel is disabled then we will not see "G2H fence (1311) not
found!".
During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().
>
>> [ 177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>> GuC PC status query
>> [ 177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>> GT[0]
>> [ 178.632672] xe 0000:00:02.0: [drm] GT0: reset started
>> [ 178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>> running GT[0] // Worker ran
>> [ 178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
>>
>>>
>>>> + * and g2h_worker queued, but it didn't get opportunity to execute
>>>> + * and timeout occurred. To address the g2h_worker is being flushed.
>>>> + */
>>>> + if (!ret) {
>>>> + flush_work(&ct->g2h_worker);
>>>> + ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>
>>> If this is needed I wouldn't wait 1 second, if the flush worked
>>> 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
>>
>> In config HZ is set to 250, which is 4 ms I think.
>>
>
> HZ should always be one second [1].
>
> [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
>
>> CONFIG_HZ_250=y
>> # CONFIG_HZ_300 is not set
>> # CONFIG_HZ_1000 is not set
>> CONFIG_HZ=250
>>
>
> I'm little confused how this Kconfig works [2] but I don't think
> actually changes the time of HZ rather it changes how many jiffies are
> in one second.
>
> [2] https://lwn.net/Articles/56378/
Oh ok, Thanks for clarification.
Regards,
Badal
>
> Matt
>
>> Regards,
>> Badal
>>
>>>
>>> Matt
>>>
>>>> + }
>>>> +
>>>> if (!ret) {
>>>> xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
>>>> g2h_fence.seqno, action[0]);
>>>> --
>>>> 2.34.1
>>>>
>>
More information about the Intel-xe
mailing list