<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>After check the code , in KFD side , should be simple just add
      the check in stop_cpsch code . For kiq, there is no return for
      WREG32 , so no easy way to check the return value . Maybe we can
      add kiq_status in struct amdgpu_kiq  to indicate the kiq is hang
      or not ,  in hdq_destroy function check this  kiq_status after
      acquire_queue , finish the destroy function is kiq is hang for
      SRIOV only . <br>
    </p>
    <p> Any comments ? <br>
    </p>
    <p><br>
    </p>
    <p>shaoyun.liu<br>
    </p>
    <p><br>
    </p>
    <p>On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:<br>
    </p>
    <blockquote type="cite" cite="mid:DM6PR12MB32410F2E41C7AC550FCAB829F4520@DM6PR12MB3241.namprd12.prod.outlook.com">
      
      I see, thanks for the detail information.<br>
      Normally when CP is hang, the hiq access to unmap the queue will
      failed before driver call to the hqd_destroy. I think driver
      should add the code to check the return value  and directly finish
      the pre_reset in this case . If the hiq does not hang but kiq
      hang. We can use the same logic in hqd_destroy function,  return
      in first access failure instead go further.  With this change we
      probably can move the pre_reset function back to normal .
      <br>
      Felix, do you have any concerns or comments for the change.<br>
      <br>
      Regards<br>
      Shaoyun.liu<br>
      <hr style="display:inline-block;width:98%" tabindex="-1">
      <div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Liu,
          Monk <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a><br>
          <b>Sent:</b> December 19, 2019 1:13:24 AM<br>
          <b>To:</b> Liu, Shaoyun <a class="moz-txt-link-rfc2396E" href="mailto:Shaoyun.Liu@amd.com"><Shaoyun.Liu@amd.com></a>;
          <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
          <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org"><amd-gfx@lists.freedesktop.org></a><br>
          <b>Subject:</b> RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test
          fail in TDR of SRIOV</font>
        <div> </div>
      </div>
      <style>
<!--
@font-face
        {font-family:"Cambria Math"}
@font-face
        {font-family:DengXian}
@font-face
        {font-family:Calibri}
@font-face
        {font-family:DengXian}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
        {color:blue;
        text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
        {color:purple;
        text-decoration:underline}
p.x_msonormal0, li.x_msonormal0, div.x_msonormal0
        {margin-right:0in;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif}
span.x_EmailStyle19
        {font-family:"Calibri",sans-serif;
        color:windowtext}
.x_MsoChpDefault
        {font-size:10.0pt}
@page WordSection1
        {margin:1.0in 1.25in 1.0in 1.25in}
div.x_WordSection1
        {}
-->
</style>
      <div link="blue" vlink="purple" lang="EN-US">
        <div class="x_WordSection1">
          <p class="x_MsoNormal">>>> I would like to check why
            we need a special sequences for sriov on this pre_reset. If
            possible, make it the same as bare metal mode sounds better
            solution.</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">Because before VF FLR calling function
            would lead to register access through KIQ,  which will not
            complete because KIQ/GFX already hang by that time</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">>>> I don't remember any
            register access by amdkfd_pre_reset call,   please let me
            know if this assumption is wrong .</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">Please check “void pm_uninit(struct
            packet_manager *pm)” which is invoked inside of
            amdkfd_pre_reset() :
          </p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">It will call uninitialized() in
            kfd_kernel_queue.c file</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">And then go to the path of
            “kq->mqd_mgr->destroy_mqd(…)”</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">And finally it calls “static int <span style="background:yellow">
              kgd_hqd_destroy</span>(…)” in amdgpu_amdkfd_gfx_v10.c</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">539 {</p>
          <p class="x_MsoNormal">540     struct amdgpu_device *adev =
            get_amdgpu_device(kgd);</p>
          <p class="x_MsoNormal">541     enum hqd_dequeue_request_type
            type;</p>
          <p class="x_MsoNormal">542     unsigned long end_jiffies;</p>
          <p class="x_MsoNormal">543     uint32_t temp;</p>
          <p class="x_MsoNormal">544     struct v10_compute_mqd *m =
            get_mqd(mqd);</p>
          <p class="x_MsoNormal">545</p>
          <p class="x_MsoNormal">546 #if 0</p>
          <p class="x_MsoNormal">547     unsigned long flags;</p>
          <p class="x_MsoNormal">548     int retry;</p>
          <p class="x_MsoNormal">549 #endif</p>
          <p class="x_MsoNormal">550</p>
          <p class="x_MsoNormal">551     acquire_queue(kgd, pipe_id,
            queue_id); <span style="background:yellow">
              //this introduce register access via KIQ</span></p>
          <p class="x_MsoNormal">552</p>
          <p class="x_MsoNormal">553     if (m->cp_hqd_vmid == 0)</p>
          <p class="x_MsoNormal">554         WREG32_FIELD15(GC, 0,
            RLC_CP_SCHEDULERS, scheduler1, 0);
            <span style="background:yellow">//this introduce register
              access via KIQ</span></p>
          <p class="x_MsoNormal">555</p>
          <p class="x_MsoNormal">556     switch (reset_type) {</p>
          <p class="x_MsoNormal">557     case
            KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:</p>
          <p class="x_MsoNormal">558         type = DRAIN_PIPE;</p>
          <p class="x_MsoNormal">559         break;</p>
          <p class="x_MsoNormal">560     case
            KFD_PREEMPT_TYPE_WAVEFRONT_RESET:</p>
          <p class="x_MsoNormal">561         type = RESET_WAVES;</p>
          <p class="x_MsoNormal">562         break;</p>
          <p class="x_MsoNormal">563     default:</p>
          <p class="x_MsoNormal">564         type = DRAIN_PIPE;</p>
          <p class="x_MsoNormal">565         break;</p>
          <p class="x_MsoNormal">566     }</p>
          <p class="x_MsoNormal">624     WREG32(SOC15_REG_OFFSET(GC, 0,
            mmCP_HQD_DEQUEUE_REQUEST), type);
            <span style="background:yellow">//this introduce register
              access via KIQ</span></p>
          <p class="x_MsoNormal">625</p>
          <p class="x_MsoNormal">626     end_jiffies = (utimeout * HZ /
            1000) + jiffies;</p>
          <p class="x_MsoNormal">627     while (true) {</p>
          <p class="x_MsoNormal">628         temp =
            RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
            <span style="background:yellow">//this introduce register
              access via KIQ</span></p>
          <p class="x_MsoNormal">629         if (!(temp &
            CP_HQD_ACTIVE__ACTIVE_MASK))</p>
          <p class="x_MsoNormal">630             break;</p>
          <p class="x_MsoNormal">631         if (time_after(jiffies,
            end_jiffies)) {</p>
          <p class="x_MsoNormal">632             pr_err("cp queue
            preemption time out.\n");</p>
          <p class="x_MsoNormal">633             release_queue(kgd);</p>
          <p class="x_MsoNormal">634             return -ETIME;</p>
          <p class="x_MsoNormal">635         }</p>
          <p class="x_MsoNormal">636         usleep_range(500, 1000);</p>
          <p class="x_MsoNormal">637     }</p>
          <p class="x_MsoNormal">638</p>
          <p class="x_MsoNormal">639     release_queue(kgd);</p>
          <p class="x_MsoNormal">640     return 0;</p>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">If we use the sequence from bare-metal,
            all above <span style="background:yellow">
              highlighted</span> register access will not work because
            KIQ/GFX already died by that time which means the
            amdkfd_pre_reset() is actually  not working as expected.</p>
          <p class="x_MsoNormal"> </p>
          <div>
            <p class="x_MsoNormal">_____________________________________</p>
            <p class="x_MsoNormal"><span style="font-size:12.0pt;
                color:black; background:white">Monk Liu|GPU
                Virtualization Team |</span><span style="font-size:12.0pt; color:#C82613; border:none
                windowtext 1.0pt; padding:0in; background:white">AMD</span></p>
            <p class="x_MsoNormal"><img id="x_Picture_x0020_1" alt="sig-cloud-gpu" style="width:.8333in;
                height:.8333in" data-outlook-trace="F:1|T:1" src="cid:part1.23715BFA.C2802CFE@amd.com" class="" width="80" height="80"></p>
          </div>
          <p class="x_MsoNormal"> </p>
          <div>
            <div style="border:none; border-top:solid #E1E1E1 1.0pt;
              padding:3.0pt 0in 0in 0in">
              <p class="x_MsoNormal"><b>From:</b> Liu, Shaoyun
                <a class="moz-txt-link-rfc2396E" href="mailto:Shaoyun.Liu@amd.com"><Shaoyun.Liu@amd.com></a> <br>
                <b>Sent:</b> Thursday, December 19, 2019 12:30 PM<br>
                <b>To:</b> Liu, Monk <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>;
                <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
                <b>Subject:</b> Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring
                test fail in TDR of SRIOV</p>
            </div>
          </div>
          <p class="x_MsoNormal"> </p>
          <p class="x_MsoNormal">I don't remember any register access by
            amdkfd_pre_reset call,   please let me know if this
            assumption is wrong .
            <br>
            This function will use hiq to access CP, in case CP already
            hang, we might not able to get the response from hw and will
            got a timeout. I think kfd internal should handle this.
            Felix already have some comments on that.
            <br>
            I would like to check why we need a special sequences for
            sriov on this pre_reset. If possible, make it the same as
            bare metal mode sounds better solution.
            <br>
            <br>
            Regards<br>
            Shaoyun.liu</p>
          <div class="x_MsoNormal" style="text-align:center" align="center">
            <hr width="98%" size="3" align="center">
          </div>
          <div id="x_divRplyFwdMsg">
            <p class="x_MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> Liu, Monk <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>><br>
                <b>Sent:</b> December 18, 2019 10:52:47 PM<br>
                <b>To:</b> Liu, Shaoyun <<a href="mailto:Shaoyun.Liu@amd.com" moz-do-not-send="true">Shaoyun.Liu@amd.com</a>>;
                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                <b>Subject:</b> RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring
                test fail in TDR of SRIOV</span>
            </p>
            <div>
              <p class="x_MsoNormal"> </p>
            </div>
          </div>
          <div>
            <div>
              <p class="x_MsoNormal">Oh, by the way<br>
                <br>
                >>> Do we know the root cause why this function
                would ruin MEC ?<br>
                <br>
                Only we call this function right after VF FLR will ruin
                MEC and lead to following KIQ ring test fail , and on
                bare-metal it is called before gpu rest , so that's why
                on bare-metal we don't have this issue
                <br>
                <br>
                But the reason we cannot call it before VF FLR on SRIOV
                case was already stated in this thread
                <br>
                <br>
                Thanks<br>
                _____________________________________<br>
                Monk Liu|GPU Virtualization Team |AMD<br>
                <br>
                <br>
                -----Original Message-----<br>
                From: Liu, Monk <br>
                Sent: Thursday, December 19, 2019 11:49 AM<br>
                To: shaoyunl <<a href="mailto:shaoyun.liu@amd.com" moz-do-not-send="true">shaoyun.liu@amd.com</a>>; <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">
                  amd-gfx@lists.freedesktop.org</a><br>
                Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test
                fail in TDR of SRIOV<br>
                <br>
                Hi Shaoyun<br>
                <br>
                >>> Do we know the root cause why this function
                would ruin MEC ? From the logic, I think this function
                should be called before FLR since we need to disable the
                user queue submission first.<br>
                Right now I don't know which detail step lead to KIQ
                ring test fail, I totally agree with you that this func
                should be called before VF FLR, but we cannot do it and
                the reason is described in The comment:<br>
                <br>
                > if we do pre_reset() before VF FLR, it would go KIQ
                way to do register <br>
                > access and stuck there, because KIQ probably won't
                work by that time <br>
                > (e.g. you already made GFX hang)<br>
                <br>
                <br>
                >>> I remembered the function should use hiq to
                communicate with HW , shouldn't use kiq to access HW
                registerm,  has this been changed ?<br>
                Tis function use WREG32/RREG32 to do register access,
                like all other functions in KMD,  and WREG32/RREG32 will
                let KIQ to do the register access If we are under
                dynamic SRIOV  mode (means we are SRIOV VF and isn't
                under full exclusive mode)<br>
                <br>
                You see that if you call this func before EVENT_5 (event
                5 triggers VF FLR) then it will run under dynamic mode
                and KIQ will handle the register access, which is not an
                option Since ME/MEC probably already hang ( if we are
                testing quark on gfx/compute rings)<br>
                <br>
                Do you have a good suggestion ?<br>
                <br>
                thanks<br>
                _____________________________________<br>
                Monk Liu|GPU Virtualization Team |AMD<br>
                <br>
                <br>
                -----Original Message-----<br>
                From: amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                On Behalf Of shaoyunl<br>
                Sent: Tuesday, December 17, 2019 11:38 PM<br>
                To: <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test
                fail in TDR of SRIOV<br>
                <br>
                I think amdkfd side depends on this call to stop the
                user queue, without this call, the user queue can submit
                to HW during the reset which could cause hang again ...<br>
                Do we know the root cause why this function would ruin
                MEC ? From the logic, I think this function should be
                called before FLR since we need to disable the user
                queue submission first.<br>
                I remembered the function should use hiq to communicate
                with HW , shouldn't use kiq to access HW registerm,  has
                this been changed ?<br>
                <br>
                <br>
                Regards<br>
                shaoyun.liu<br>
                <br>
                <br>
                On 2019-12-17 5:19 a.m., Monk Liu wrote:<br>
                > issues:<br>
                > MEC is ruined by the amdkfd_pre_reset after VF FLR
                done<br>
                ><br>
                > fix:<br>
                > amdkfd_pre_reset() would ruin MEC after hypervisor
                finished the VF <br>
                > FLR, the correct sequence is do amdkfd_pre_reset
                before VF FLR but <br>
                > there is a limitation to block this sequence:<br>
                > if we do pre_reset() before VF FLR, it would go KIQ
                way to do register <br>
                > access and stuck there, because KIQ probably won't
                work by that time <br>
                > (e.g. you already made GFX hang)<br>
                ><br>
                > so the best way right now is to simply remove it.<br>
                ><br>
                > Signed-off-by: Monk Liu <<a href="mailto:Monk.Liu@amd.com" moz-do-not-send="true">Monk.Liu@amd.com</a>><br>
                > ---<br>
                >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --<br>
                >   1 file changed, 2 deletions(-)<br>
                ><br>
                > diff --git
                a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                > index 605cef6..ae962b9 100644<br>
                > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                > @@ -3672,8 +3672,6 @@ static int
                amdgpu_device_reset_sriov(struct amdgpu_device *adev,<br>
                >        if (r)<br>
                >                return r;<br>
                >   <br>
                > -     amdgpu_amdkfd_pre_reset(adev);<br>
                > -<br>
                >        /* Resume IP prior to SMC */<br>
                >        r =
                amdgpu_device_ip_reinit_early_sriov(adev);<br>
                >        if (r)<br>
                _______________________________________________<br>
                amd-gfx mailing list<br>
                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639048992279&sdata=38z3sISWX26bZPplKeHvD0xIPCRbPAW%2BgKv2cXqetXc%3D&reserved=0" originalsrc="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" shash="fJjiWFKlD22S65DZzKzYfWPzvbk6szqEbARbO2cHpufzWepvYwZUK8/RfuhYscLVYP0/3E5byFt54ohYSkKrbBwnDazDf2rzjOxnlZFN/Y+Qe1uahIq4uneyCNMCvWijSFpqjwkgpvx+irf/BA+w1/RQX5MzXTTfoyKCf8pD4os=" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0</a></p>
            </div>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&amp;sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&amp;reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&amp;sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&amp;reserved=0</a>
</pre>
    </blockquote>
  </body>
</html>