[PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
shaoyunl
shaoyun.liu at amd.com
Thu Dec 19 16:59:20 UTC 2019
After check the code , in KFD side , should be simple just add the check
in stop_cpsch code . For kiq, there is no return for WREG32 , so no easy
way to check the return value . Maybe we can add kiq_status in struct
amdgpu_kiq to indicate the kiq is hang or not , in hdq_destroy
function check this kiq_status after acquire_queue , finish the destroy
function is kiq is hang for SRIOV only .
Any comments ?
shaoyun.liu
On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:
> I see, thanks for the detail information.
> Normally when CP is hang, the hiq access to unmap the queue will
> failed before driver call to the hqd_destroy. I think driver should
> add the code to check the return value and directly finish the
> pre_reset in this case . If the hiq does not hang but kiq hang. We can
> use the same logic in hqd_destroy function, return in first access
> failure instead go further. With this change we probably can move the
> pre_reset function back to normal .
> Felix, do you have any concerns or comments for the change.
>
> Regards
> Shaoyun.liu
> ------------------------------------------------------------------------
> *From:* Liu, Monk <Monk.Liu at amd.com>
> *Sent:* December 19, 2019 1:13:24 AM
> *To:* Liu, Shaoyun <Shaoyun.Liu at amd.com>;
> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR
> of SRIOV
>
> >>> I would like to check why we need a special sequences for sriov on
> this pre_reset. If possible, make it the same as bare metal mode
> sounds better solution.
>
> Because before VF FLR calling function would lead to register access
> through KIQ, which will not complete because KIQ/GFX already hang by
> that time
>
> >>> I don't remember any register access by amdkfd_pre_reset call,
> please let me know if this assumption is wrong .
>
> Please check “void pm_uninit(struct packet_manager *pm)” which is
> invoked inside of amdkfd_pre_reset() :
>
> It will call uninitialized() in kfd_kernel_queue.c file
>
> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>
> And finally it calls “static int kgd_hqd_destroy(…)” in
> amdgpu_amdkfd_gfx_v10.c
>
> 539 {
>
> 540 struct amdgpu_device *adev = get_amdgpu_device(kgd);
>
> 541 enum hqd_dequeue_request_type type;
>
> 542 unsigned long end_jiffies;
>
> 543 uint32_t temp;
>
> 544 struct v10_compute_mqd *m = get_mqd(mqd);
>
> 545
>
> 546 #if 0
>
> 547 unsigned long flags;
>
> 548 int retry;
>
> 549 #endif
>
> 550
>
> 551 acquire_queue(kgd, pipe_id, queue_id); //this introduce
> register access via KIQ
>
> 552
>
> 553 if (m->cp_hqd_vmid == 0)
>
> 554 WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0);
> //this introduce register access via KIQ
>
> 555
>
> 556 switch (reset_type) {
>
> 557 case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>
> 558 type = DRAIN_PIPE;
>
> 559 break;
>
> 560 case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>
> 561 type = RESET_WAVES;
>
> 562 break;
>
> 563 default:
>
> 564 type = DRAIN_PIPE;
>
> 565 break;
>
> 566 }
>
> 624 WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
> type); //this introduce register access via KIQ
>
> 625
>
> 626 end_jiffies = (utimeout * HZ / 1000) + jiffies;
>
> 627 while (true) {
>
> 628 temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
> //this introduce register access via KIQ
>
> 629 if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>
> 630 break;
>
> 631 if (time_after(jiffies, end_jiffies)) {
>
> 632 pr_err("cp queue preemption time out.\n");
>
> 633 release_queue(kgd);
>
> 634 return -ETIME;
>
> 635 }
>
> 636 usleep_range(500, 1000);
>
> 637 }
>
> 638
>
> 639 release_queue(kgd);
>
> 640 return 0;
>
> If we use the sequence from bare-metal, all above highlighted register
> access will not work because KIQ/GFX already died by that time which
> means the amdkfd_pre_reset() is actually not working as expected.
>
> _____________________________________
>
> Monk Liu|GPU Virtualization Team |AMD
>
> sig-cloud-gpu
>
> *From:* Liu, Shaoyun <Shaoyun.Liu at amd.com>
> *Sent:* Thursday, December 19, 2019 12:30 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR
> of SRIOV
>
> I don't remember any register access by amdkfd_pre_reset call,
> please let me know if this assumption is wrong .
> This function will use hiq to access CP, in case CP already hang, we
> might not able to get the response from hw and will got a timeout. I
> think kfd internal should handle this. Felix already have some
> comments on that.
> I would like to check why we need a special sequences for sriov on
> this pre_reset. If possible, make it the same as bare metal mode
> sounds better solution.
>
> Regards
> Shaoyun.liu
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>
> *Sent:* December 18, 2019 10:52:47 PM
> *To:* Liu, Shaoyun <Shaoyun.Liu at amd.com <mailto:Shaoyun.Liu at amd.com>>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR
> of SRIOV
>
> Oh, by the way
>
> >>> Do we know the root cause why this function would ruin MEC ?
>
> Only we call this function right after VF FLR will ruin MEC and lead
> to following KIQ ring test fail , and on bare-metal it is called
> before gpu rest , so that's why on bare-metal we don't have this issue
>
> But the reason we cannot call it before VF FLR on SRIOV case was
> already stated in this thread
>
> Thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, December 19, 2019 11:49 AM
> To: shaoyunl <shaoyun.liu at amd.com <mailto:shaoyun.liu at amd.com>>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of
> SRIOV
>
> Hi Shaoyun
>
> >>> Do we know the root cause why this function would ruin MEC ? From
> the logic, I think this function should be called before FLR since we
> need to disable the user queue submission first.
> Right now I don't know which detail step lead to KIQ ring test fail, I
> totally agree with you that this func should be called before VF FLR,
> but we cannot do it and the reason is described in The comment:
>
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
>
>
> >>> I remembered the function should use hiq to communicate with HW ,
> shouldn't use kiq to access HW registerm, has this been changed ?
> Tis function use WREG32/RREG32 to do register access, like all other
> functions in KMD, and WREG32/RREG32 will let KIQ to do the register
> access If we are under dynamic SRIOV mode (means we are SRIOV VF and
> isn't under full exclusive mode)
>
> You see that if you call this func before EVENT_5 (event 5 triggers VF
> FLR) then it will run under dynamic mode and KIQ will handle the
> register access, which is not an option Since ME/MEC probably already
> hang ( if we are testing quark on gfx/compute rings)
>
> Do you have a good suggestion ?
>
> thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> On Behalf Of shaoyunl
> Sent: Tuesday, December 17, 2019 11:38 PM
> To: amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of
> SRIOV
>
> I think amdkfd side depends on this call to stop the user queue,
> without this call, the user queue can submit to HW during the reset
> which could cause hang again ...
> Do we know the root cause why this function would ruin MEC ? From the
> logic, I think this function should be called before FLR since we need
> to disable the user queue submission first.
> I remembered the function should use hiq to communicate with HW ,
> shouldn't use kiq to access HW registerm, has this been changed ?
>
>
> Regards
> shaoyun.liu
>
>
> On 2019-12-17 5:19 a.m., Monk Liu wrote:
> > issues:
> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
> >
> > fix:
> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> > there is a limitation to block this sequence:
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
> >
> > so the best way right now is to simply remove it.
> >
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 605cef6..ae962b9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct
> amdgpu_device *adev,
> > if (r)
> > return r;
> >
> > - amdgpu_amdkfd_pre_reset(adev);
> > -
> > /* Resume IP prior to SMC */
> > r = amdgpu_device_ip_reinit_early_sriov(adev);
> > if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&reserved=0
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639048992279&sdata=38z3sISWX26bZPplKeHvD0xIPCRbPAW%2BgKv2cXqetXc%3D&reserved=0>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191219/9cde60b1/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 12243 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191219/9cde60b1/attachment-0001.png>
More information about the amd-gfx
mailing list