[PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Thu Dec 19 16:59:20 UTC 2019

After check the code , in KFD side , should be simple just add the check 
in stop_cpsch code . For kiq, there is no return for WREG32 , so no easy 
way to check the return value . Maybe we can add kiq_status in struct 
amdgpu_kiq  to indicate the kiq is hang or not ,  in hdq_destroy 
function check this  kiq_status after acquire_queue , finish the destroy 
function is kiq is hang for SRIOV only .

Any comments ?

shaoyun.liu

On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:

> I see, thanks for the detail information.
> Normally when CP is hang, the hiq access to unmap the queue will 
> failed before driver call to the hqd_destroy. I think driver should 
> add the code to check the return value  and directly finish the 
> pre_reset in this case . If the hiq does not hang but kiq hang. We can 
> use the same logic in hqd_destroy function,  return in first access 
> failure instead go further.  With this change we probably can move the 
> pre_reset function back to normal .
> Felix, do you have any concerns or comments for the change.
>
> Regards
> Shaoyun.liu
> ------------------------------------------------------------------------
> *From:* Liu, Monk <Monk.Liu at amd.com>
> *Sent:* December 19, 2019 1:13:24 AM
> *To:* Liu, Shaoyun <Shaoyun.Liu at amd.com>; 
> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> >>> I would like to check why we need a special sequences for sriov on 
> this pre_reset. If possible, make it the same as bare metal mode 
> sounds better solution.
>
> Because before VF FLR calling function would lead to register access 
> through KIQ,  which will not complete because KIQ/GFX already hang by 
> that time
>
> >>> I don't remember any register access by amdkfd_pre_reset call,   
> please let me know if this assumption is wrong .
>
> Please check “void pm_uninit(struct packet_manager *pm)” which is 
> invoked inside of amdkfd_pre_reset() :
>
> It will call uninitialized() in kfd_kernel_queue.c file
>
> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>
> And finally it calls “static int kgd_hqd_destroy(…)” in 
> amdgpu_amdkfd_gfx_v10.c
>
> 539 {
>
> 540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
>
> 541     enum hqd_dequeue_request_type type;
>
> 542     unsigned long end_jiffies;
>
> 543     uint32_t temp;
>
> 544     struct v10_compute_mqd *m = get_mqd(mqd);
>
> 545
>
> 546 #if 0
>
> 547     unsigned long flags;
>
> 548     int retry;
>
> 549 #endif
>
> 550
>
> 551     acquire_queue(kgd, pipe_id, queue_id); //this introduce 
> register access via KIQ
>
> 552
>
> 553     if (m->cp_hqd_vmid == 0)
>
> 554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); 
> //this introduce register access via KIQ
>
> 555
>
> 556     switch (reset_type) {
>
> 557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>
> 558         type = DRAIN_PIPE;
>
> 559         break;
>
> 560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>
> 561         type = RESET_WAVES;
>
> 562         break;
>
> 563     default:
>
> 564         type = DRAIN_PIPE;
>
> 565         break;
>
> 566     }
>
> 624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), 
> type); //this introduce register access via KIQ
>
> 625
>
> 626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
>
> 627     while (true) {
>
> 628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); 
> //this introduce register access via KIQ
>
> 629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>
> 630             break;
>
> 631         if (time_after(jiffies, end_jiffies)) {
>
> 632             pr_err("cp queue preemption time out.\n");
>
> 633             release_queue(kgd);
>
> 634             return -ETIME;
>
> 635         }
>
> 636         usleep_range(500, 1000);
>
> 637     }
>
> 638
>
> 639     release_queue(kgd);
>
> 640     return 0;
>
> If we use the sequence from bare-metal, all above highlighted register 
> access will not work because KIQ/GFX already died by that time which 
> means the amdkfd_pre_reset() is actually  not working as expected.
>
> _____________________________________
>
> Monk Liu|GPU Virtualization Team |AMD
>
> sig-cloud-gpu
>
> *From:* Liu, Shaoyun <Shaoyun.Liu at amd.com>
> *Sent:* Thursday, December 19, 2019 12:30 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> I don't remember any register access by amdkfd_pre_reset call,   
> please let me know if this assumption is wrong .
> This function will use hiq to access CP, in case CP already hang, we 
> might not able to get the response from hw and will got a timeout. I 
> think kfd internal should handle this. Felix already have some 
> comments on that.
> I would like to check why we need a special sequences for sriov on 
> this pre_reset. If possible, make it the same as bare metal mode 
> sounds better solution.
>
> Regards
> Shaoyun.liu
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>
> *Sent:* December 18, 2019 10:52:47 PM
> *To:* Liu, Shaoyun <Shaoyun.Liu at amd.com <mailto:Shaoyun.Liu at amd.com>>; 
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
> <amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> Oh, by the way
>
> >>> Do we know the root cause why this function would ruin MEC ?
>
> Only we call this function right after VF FLR will ruin MEC and lead 
> to following KIQ ring test fail , and on bare-metal it is called 
> before gpu rest , so that's why on bare-metal we don't have this issue
>
> But the reason we cannot call it before VF FLR on SRIOV case was 
> already stated in this thread
>
> Thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, December 19, 2019 11:49 AM
> To: shaoyunl <shaoyun.liu at amd.com <mailto:shaoyun.liu at amd.com>>; 
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
> SRIOV
>
> Hi Shaoyun
>
> >>> Do we know the root cause why this function would ruin MEC ? From 
> the logic, I think this function should be called before FLR since we 
> need to disable the user queue submission first.
> Right now I don't know which detail step lead to KIQ ring test fail, I 
> totally agree with you that this func should be called before VF FLR, 
> but we cannot do it and the reason is described in The comment:
>
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
>
>
> >>> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
> Tis function use WREG32/RREG32 to do register access, like all other 
> functions in KMD,  and WREG32/RREG32 will let KIQ to do the register 
> access If we are under dynamic SRIOV  mode (means we are SRIOV VF and 
> isn't under full exclusive mode)
>
> You see that if you call this func before EVENT_5 (event 5 triggers VF 
> FLR) then it will run under dynamic mode and KIQ will handle the 
> register access, which is not an option Since ME/MEC probably already 
> hang ( if we are testing quark on gfx/compute rings)
>
> Do you have a good suggestion ?
>
> thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org 
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> On Behalf Of shaoyunl
> Sent: Tuesday, December 17, 2019 11:38 PM
> To: amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
> SRIOV
>
> I think amdkfd side depends on this call to stop the user queue, 
> without this call, the user queue can submit to HW during the reset 
> which could cause hang again ...
> Do we know the root cause why this function would ruin MEC ? From the 
> logic, I think this function should be called before FLR since we need 
> to disable the user queue submission first.
> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
>
>
> Regards
> shaoyun.liu
>
>
> On 2019-12-17 5:19 a.m., Monk Liu wrote:
> > issues:
> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
> >
> > fix:
> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> > there is a limitation to block this sequence:
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
> >
> > so the best way right now is to simply remove it.
> >
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> >   1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 605cef6..ae962b9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
> >        if (r)
> >                return r;
> >
> > -     amdgpu_amdkfd_pre_reset(adev);
> > -
> >        /* Resume IP prior to SMC */
> >        r = amdgpu_device_ip_reinit_early_sriov(adev);
> >        if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639048992279&sdata=38z3sISWX26bZPplKeHvD0xIPCRbPAW%2BgKv2cXqetXc%3D&reserved=0>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191219/9cde60b1/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 12243 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191219/9cde60b1/attachment-0001.png>