<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Thanks Ma, this was very helpful as I am sill not able to setup
      XGMI hive with latest FW and VBIOS.</p>
    <p>I traced the workqueue subsystem (full log attached).
      Specifically here is the life cycle of our 2 work items executing
      amdgpu_device_xgmi_reset_func bellow</p>
    <p>You were right to note they both run on came CPU (32) but they
      are executed by different threads. Also as you see by
      workqueue_execute_start/end timestamps they actually ran in
      parallel and not one after another even while being assigned to
      the same CPU and that because of thread preemption (there is at
      least psp_v11_0_mode1_reset->msleep(500)) which yields the CPU
      and hence allows the second work to run + I am sure that on
      preemptive kernel one reset work would be preempted at some point
      anyway  and let the other run. Now you had issues with BACO reset
      while the test I ran on your system is mode1 reset and so I
      assumed that maybe BACO has some non preempt-able busy wait which
      doesn't give a chance to second work item's thread to run on that
      CPU before the first finished - but from looking in the code I see
      smu_v11_0_baco_enter->msleep(10) so even in that case the first
      reset work item was supposed to yield CPU after BACO ENTER sent to
      SMU and let the other reset work do the same to the second card
      and so i don't see how even in this case there is a serial
      execution ?</p>
    <p>P.S How you solution solves the case where the XGMI hive is
      bigger then number of CPUs on the system ? Assuming that what you
      say is correct and there is a serial execution when on the same
      CPU, if they hive is bigger then number of CPUs you will
      eventually get back to sending reset work to a CPU already
      executing BACO ENTER (or EXIT) for another device and will get the
      serialization problem anyway. <br>
    </p>
    <p>             cat-3002  [032] d... 33153.791829:
      workqueue_queue_work: work struct=00000000e43c1ebb
      function=amdgpu_device_xgmi_reset_func [amdgpu]
      workqueue=0000000080331d91 req_cpu=8192 cpu=32<br>
                   cat-3002  [032] d... 33153.791829:
      workqueue_activate_work: work struct 00000000e43c1ebb<br>
                   cat-3002  [032] dN.. 33153.791831:
      workqueue_queue_work: work struct=00000000e67113aa
      function=amdgpu_device_xgmi_reset_func [amdgpu]
      workqueue=0000000080331d91 req_cpu=8192 cpu=32<br>
                   cat-3002  [032] dN.. 33153.791832:
      workqueue_activate_work: work struct 00000000e67113aa<br>
         kworker/32:1H-551   [032] .... 33153.791834:
      workqueue_execute_start: work struct 00000000e43c1ebb: function
      amdgpu_device_xgmi_reset_func [amdgpu]<br>
         kworker/32:0H-175   [032] .... 33153.792087:
      workqueue_execute_start: work struct 00000000e67113aa: function
      amdgpu_device_xgmi_reset_func [amdgpu]<br>
         kworker/32:1H-551   [032] .... 33154.310948:
      workqueue_execute_end: work struct 00000000e43c1ebb<br>
         kworker/32:0H-175   [032] .... 33154.311043:
      workqueue_execute_end: work struct 00000000e67113aa</p>
    <p>Andrey<br>
    </p>
    <p><br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 12/3/19 5:06 AM, Ma, Le wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:MN2PR12MB42855B198BB4064A0D311845F6420@MN2PR12MB4285.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@DengXian";
        panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
        {mso-style-priority:99;
        mso-style-link:"Plain Text Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:14.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
p.msonormal0, li.msonormal0, div.msonormal0
        {mso-style-name:msonormal;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
span.PlainTextChar
        {mso-style-name:"Plain Text Char";
        mso-style-priority:99;
        mso-style-link:"Plain Text";
        font-family:"Calibri",sans-serif;}
p.msipheadera92e061b, li.msipheadera92e061b, div.msipheadera92e061b
        {mso-style-name:msipheadera92e061b;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
span.EmailStyle21
        {mso-style-type:personal;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.EmailStyle22
        {mso-style-type:personal;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.EmailStyle23
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="msipheadera92e061b" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
            Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">Hi Andrey,<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">You can try the
            XGMI system below:<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">              IP:
            10.67.69.53<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">              U/P:
            jenkins/0<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">The original
            drm-next kernel is installed.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">Regards,<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext">Ma Le<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> Grodzovsky, Andrey
                <a class="moz-txt-link-rfc2396E" href="mailto:Andrey.Grodzovsky@amd.com"><Andrey.Grodzovsky@amd.com></a>
                <br>
                <b>Sent:</b> Tuesday, December 3, 2019 6:05 AM<br>
                <b>To:</b> Ma, Le <a class="moz-txt-link-rfc2396E" href="mailto:Le.Ma@amd.com"><Le.Ma@amd.com></a>;
                <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
                <b>Cc:</b> Chen, Guchun <a class="moz-txt-link-rfc2396E" href="mailto:Guchun.Chen@amd.com"><Guchun.Chen@amd.com></a>;
                Zhou1, Tao <a class="moz-txt-link-rfc2396E" href="mailto:Tao.Zhou1@amd.com"><Tao.Zhou1@amd.com></a>; Deucher, Alexander
                <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>; Li, Dennis
                <a class="moz-txt-link-rfc2396E" href="mailto:Dennis.Li@amd.com"><Dennis.Li@amd.com></a>; Zhang, Hawking
                <a class="moz-txt-link-rfc2396E" href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a><br>
                <b>Subject:</b> Re: [PATCH 07/10] drm/amdgpu: add
                concurrent baco reset support for XGMI<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p><o:p> </o:p></p>
        <div>
          <p class="MsoNormal">On 12/2/19 6:42 AM, Ma, Le wrote:<o:p></o:p></p>
        </div>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="msipheadera92e061b" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
              Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
          <p class="MsoNormal"><span style="color:windowtext"> </span><o:p></o:p></p>
          <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"> </span><o:p></o:p></p>
          <p class="MsoNormal"><span style="font-size:12.0pt;color:windowtext"> </span><o:p></o:p></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0in 0in 0in">
              <p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> Grodzovsky, Andrey
                  <a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true"><Andrey.Grodzovsky@amd.com></a>
                  <br>
                  <b>Sent:</b> Saturday, November 30, 2019 12:22 AM<br>
                  <b>To:</b> Ma, Le <a href="mailto:Le.Ma@amd.com" moz-do-not-send="true"><Le.Ma@amd.com></a>; <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">
                    amd-gfx@lists.freedesktop.org</a><br>
                  <b>Cc:</b> Chen, Guchun <a href="mailto:Guchun.Chen@amd.com" moz-do-not-send="true"><Guchun.Chen@amd.com></a>;
                  Zhou1, Tao
                  <a href="mailto:Tao.Zhou1@amd.com" moz-do-not-send="true"><Tao.Zhou1@amd.com></a>;
                  Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">
                    <Alexander.Deucher@amd.com></a>; Li, Dennis <a href="mailto:Dennis.Li@amd.com" moz-do-not-send="true"><Dennis.Li@amd.com></a>;
                  Zhang, Hawking
                  <a href="mailto:Hawking.Zhang@amd.com" moz-do-not-send="true"><Hawking.Zhang@amd.com></a><br>
                  <b>Subject:</b> Re: [PATCH 07/10] drm/amdgpu: add
                  concurrent baco reset support for XGMI</span><o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p> <o:p></o:p></p>
          <div>
            <p class="MsoNormal">On 11/28/19 4:00 AM, Ma, Le wrote:<o:p></o:p></p>
          </div>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">-----Original Message-----<br>
              From: Grodzovsky, Andrey <a href="mailto:Andrey.Grodzovsky@amd.com" moz-do-not-send="true"><Andrey.Grodzovsky@amd.com></a>
              <br>
              Sent: Wednesday, November 27, 2019 11:46 PM<br>
              To: Ma, Le <a href="mailto:Le.Ma@amd.com" moz-do-not-send="true"><Le.Ma@amd.com></a>; <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">
                amd-gfx@lists.freedesktop.org</a><br>
              Cc: Chen, Guchun <a href="mailto:Guchun.Chen@amd.com" moz-do-not-send="true"><Guchun.Chen@amd.com></a>;
              Zhou1, Tao
              <a href="mailto:Tao.Zhou1@amd.com" moz-do-not-send="true"><Tao.Zhou1@amd.com></a>;
              Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">
                <Alexander.Deucher@amd.com></a>; Li, Dennis <a href="mailto:Dennis.Li@amd.com" moz-do-not-send="true"><Dennis.Li@amd.com></a>;
              Zhang, Hawking
              <a href="mailto:Hawking.Zhang@amd.com" moz-do-not-send="true"><Hawking.Zhang@amd.com></a><br>
              Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
              reset support for XGMI<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">On 11/27/19 4:15 AM, Le Ma wrote:<o:p></o:p></p>
            <p class="MsoPlainText">> Currently each XGMI node reset
              wq does not run in parrallel because
              <o:p></o:p></p>
            <p class="MsoPlainText">> same work item bound to same
              cpu runs in sequence. So change to bound
              <o:p></o:p></p>
            <p class="MsoPlainText">> the xgmi_reset_work item to
              different cpus.<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">It's not the same work item, see
              more bellow<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> XGMI requires all nodes enter
              into baco within very close proximity
              <o:p></o:p></p>
            <p class="MsoPlainText">> before any node exit baco. So
              schedule the xgmi_reset_work wq twice
              <o:p></o:p></p>
            <p class="MsoPlainText">> for enter/exit baco
              respectively.<o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> The default reset code path and
              methods do not change for vega20 production:<o:p></o:p></p>
            <p class="MsoPlainText">>    - baco reset without
              xgmi/ras<o:p></o:p></p>
            <p class="MsoPlainText">>    - psp reset with xgmi/ras<o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> To enable baco for XGMI/RAS
              case, both 2 conditions below are needed:<o:p></o:p></p>
            <p class="MsoPlainText">>    - amdgpu_ras_enable=2<o:p></o:p></p>
            <p class="MsoPlainText">>    - baco-supported smu
              firmware<o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> The case that PSP reset and
              baco reset coexist within an XGMI hive is
              <o:p></o:p></p>
            <p class="MsoPlainText">> not in the consideration.<o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> Change-Id:
              I9c08cf90134f940b42e20d2129ff87fba761c532<o:p></o:p></p>
            <p class="MsoPlainText">> Signed-off-by: Le Ma <<a href="mailto:le.ma@amd.com" moz-do-not-send="true"><span style="color:windowtext;text-decoration:none">le.ma@amd.com</span></a>><o:p></o:p></p>
            <p class="MsoPlainText">> ---<o:p></o:p></p>
            <p class="MsoPlainText">>  
              drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 +<o:p></o:p></p>
            <p class="MsoPlainText">>  
              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 78
              ++++++++++++++++++++++++++----<o:p></o:p></p>
            <p class="MsoPlainText">>   2 files changed, 70
              insertions(+), 10 deletions(-)<o:p></o:p></p>
            <p class="MsoPlainText">> <o:p></o:p></p>
            <p class="MsoPlainText">> diff --git
              a/drivers/gpu/drm/amd/amdgpu/amdgpu.h <o:p></o:p></p>
            <p class="MsoPlainText">>
              b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<o:p></o:p></p>
            <p class="MsoPlainText">> index d120fe5..08929e6 100644<o:p></o:p></p>
            <p class="MsoPlainText">> ---
              a/drivers/gpu/drm/amd/amdgpu/amdgpu.h<o:p></o:p></p>
            <p class="MsoPlainText">> +++
              b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<o:p></o:p></p>
            <p class="MsoPlainText">> @@ -998,6 +998,8 @@ struct
              amdgpu_device {<o:p></o:p></p>
            <p class="MsoPlainText">>         
              int                                           pstate;<o:p></o:p></p>
            <p class="MsoPlainText">>          /* enable runtime pm
              on the device */<o:p></o:p></p>
            <p class="MsoPlainText">>         
              bool                            runpm;<o:p></o:p></p>
            <p class="MsoPlainText">> +<o:p></o:p></p>
            <p class="MsoPlainText">> +     
              bool                                        in_baco;<o:p></o:p></p>
            <p class="MsoPlainText">>   };<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">>   static inline struct
              amdgpu_device *amdgpu_ttm_adev(struct
              <o:p></o:p></p>
            <p class="MsoPlainText">> ttm_bo_device *bdev) diff --git
              <o:p></o:p></p>
            <p class="MsoPlainText">>
              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <o:p></o:p></p>
            <p class="MsoPlainText">>
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<o:p></o:p></p>
            <p class="MsoPlainText">> index bd387bb..71abfe9 100644<o:p></o:p></p>
            <p class="MsoPlainText">> ---
              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<o:p></o:p></p>
            <p class="MsoPlainText">> +++
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<o:p></o:p></p>
            <p class="MsoPlainText">> @@ -2654,7 +2654,13 @@ static
              void amdgpu_device_xgmi_reset_func(struct work_struct
              *__work)<o:p></o:p></p>
            <p class="MsoPlainText">>          struct amdgpu_device
              *adev =<o:p></o:p></p>
            <p class="MsoPlainText">>                     
              container_of(__work, struct amdgpu_device,
              xgmi_reset_work);<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">> -       adev->asic_reset_res
              =  amdgpu_asic_reset(adev);<o:p></o:p></p>
            <p class="MsoPlainText">> +      if
              (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)<o:p></o:p></p>
            <p class="MsoPlainText">> +                 
              adev->asic_reset_res = (adev->in_baco == false) ?<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
                          amdgpu_device_baco_enter(adev->ddev) :<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
                          amdgpu_device_baco_exit(adev->ddev);<o:p></o:p></p>
            <p class="MsoPlainText">> +      else<o:p></o:p></p>
            <p class="MsoPlainText">> +                 
              adev->asic_reset_res = amdgpu_asic_reset(adev);<o:p></o:p></p>
            <p class="MsoPlainText">> +<o:p></o:p></p>
            <p class="MsoPlainText">>          if
              (adev->asic_reset_res)<o:p></o:p></p>
            <p class="MsoPlainText">>                     
              DRM_WARN("ASIC reset failed with error, %d for drm dev,
              %s",<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                              adev->asic_reset_res,
              adev->ddev->unique); @@ -3796,6 +3802,7 @@
              <o:p></o:p></p>
            <p class="MsoPlainText">> static int
              amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,<o:p></o:p></p>
            <p class="MsoPlainText">>          struct amdgpu_device
              *tmp_adev = NULL;<o:p></o:p></p>
            <p class="MsoPlainText">>          bool need_full_reset =
              *need_full_reset_arg, vram_lost = false;<o:p></o:p></p>
            <p class="MsoPlainText">>          int r = 0;<o:p></o:p></p>
            <p class="MsoPlainText">> +      int cpu =
              smp_processor_id();<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">>          /*<o:p></o:p></p>
            <p class="MsoPlainText">>           * ASIC reset has to
              be done on all HGMI hive nodes ASAP @@
              <o:p></o:p></p>
            <p class="MsoPlainText">> -3803,21 +3810,24 @@ static int
              amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,<o:p></o:p></p>
            <p class="MsoPlainText">>           */<o:p></o:p></p>
            <p class="MsoPlainText">>          if (need_full_reset) {<o:p></o:p></p>
            <p class="MsoPlainText">>                     
              list_for_each_entry(tmp_adev, device_list_handle,
              gmc.xgmi.head) {<o:p></o:p></p>
            <p class="MsoPlainText">> -                              
              /* For XGMI run all resets in parallel to speed up the
              process */<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              /*<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              * For XGMI run all resets in parallel to speed up the<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              * process by scheduling the highpri wq on different<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              * cpus. For XGMI with baco reset, all nodes must enter<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              * baco within close proximity before anyone exit.<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              */<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                             if
              (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              -                                           if
              (!queue_work(system_highpri_wq,
              &tmp_adev->xgmi_reset_work))<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">Note that
              tmp_adev->xgmi_reset_work (the work item) is per device
              in XGMI hive and not the same work item. So I don't see
              why you need to explicitly queue them on different CPUs,
              they should run in parallel already.<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">Andrey<o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">[Le]:
                It’s also beyond my understanding that the 2 node reset
                work items scheduled to same cpu does not run in
                parallel. But from the experiment result in my side, the
                2nd work item always run after 1st work item finished.
                Based on this result, I changed to queue them on
                different CPUs to make sure more XGMI nodes case to run
                in parallel, because baco requires all nodes enter baco
                within very close proximity.
              </span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864"> </span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">The
                experiment code is as following for your reference. When
                card0 worker running, card1 worker is not observed to
                run.</span><o:p></o:p></p>
          </blockquote>
          <p> <o:p></o:p></p>
          <p>The code bellow will only test that they don't run
            concurrently - but this doesn't mean they don't run on
            different CPUs and threads,I don't have an XGMI setup at
            hand to test this theory but what if there is some locking
            dependency between them that serializes their execution ?
            Can you just add a one line print inside <span style="color:#203864">
              amdgpu_device_xgmi_reset_func </span>that prints CPU id,
            thread name/id and card number ?<o:p></o:p></p>
          <p>Andrey<o:p></o:p></p>
          <p><span style="color:#203864">[Le]: I checked if directly use
              queue_work() several times, the same CPU thread will be
              used. And the worker per CPU will execute the item one by
              one. Our goal here is to make the xgmi_reset_func run
              concurrently for XGMI BACO case. That’s why I schedule
              them on different CPUs to run parallelly. And I can share
              the XGMI system with you if you’d like to verify more.</span><o:p></o:p></p>
        </blockquote>
        <p><o:p> </o:p></p>
        <p>I tried today to setup XGMI 2P setup to test this but weren't
          able to load with the XGMI bridge in place (maybe faulty
          bridge) - so yea - maybe leave me your setup before your
          changes (the original code) so i can try to open some kernel
          traces that show CPU id and thread id to check this. It's just
          so weird that system_highpri_wq which is documented to be
          multi-cpu and multi-threaded wouldn't queue those work items
          to different cpus and worker threads.<o:p></o:p></p>
        <p>Andrey<o:p></o:p></p>
        <p><o:p> </o:p></p>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoPlainText"><span style="color:#203864"> </span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+atomic_t
                card0_in_baco = ATOMIC_INIT(0);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+atomic_t
                card1_in_baco = ATOMIC_INIT(0);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">static
                void amdgpu_device_xgmi_reset_func(struct work_struct
                *__work)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">{</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">       
                struct amdgpu_device *adev =</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">               
                container_of(__work, struct amdgpu_device,
                xgmi_reset_work);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864"> </span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                printk("lema1: card 0x%x goes into reset wq\n",
                adev->pdev->bus->number);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                if (adev->pdev->bus->number == 0x7) {</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                atomic_set(&card1_in_baco, 1);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                printk("lema1: card1 in baco from card1 view\n");</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                }</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">       
                if (amdgpu_asic_reset_method(adev) ==
                AMD_RESET_METHOD_BACO)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">               adev->asic_reset_res
                = (adev->in_baco == false) ?</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">                               
                amdgpu_device_baco_enter(adev->ddev) :</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">@@
                -2664,6 +2673,23 @@ static void
                amdgpu_device_xgmi_reset_func(struct work_struct
                *__work)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">       
                if (adev->asic_reset_res)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">               
                DRM_WARN("ASIC reset failed with error, %d for drm dev,
                %s",</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">                        
                adev->asic_reset_res, adev->ddev->unique);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                if (adev->pdev->bus->number == 0x4) {</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                atomic_set(&card0_in_baco, 1);</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+       
                       printk("lema1: card0 in baco from card0 view\n");</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                while (true)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+                      
                if (!!atomic_read(&card1_in_baco))</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+                              
                break;</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                printk("lema1: card1 in baco from card0 view\n");</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+  
                    }</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                if (adev->pdev->bus->number == 0x7) {</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                while (true)</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+                      
                if (!!atomic_read(&card0_in_baco))</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+                              
                break;</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+              
                printk("lema1: card0 in baco from card1 view\n");</span><o:p></o:p></p>
            <p class="MsoPlainText"><span style="color:#203864">+      
                }</span><o:p></o:p></p>
            <p class="MsoPlainText"> <o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          if
              (!queue_work_on(cpu, system_highpri_wq,<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
                 &tmp_adev->xgmi_reset_work))<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                                     r =
              -EALREADY;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          cpu =
              cpumask_next(cpu, cpu_online_mask);<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                             } else<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                         r =
              amdgpu_asic_reset(tmp_adev);<o:p></o:p></p>
            <p class="MsoPlainText">> -<o:p></o:p></p>
            <p class="MsoPlainText">> -                              
              if (r) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              -                                          
              DRM_ERROR("ASIC reset failed with error, %d for drm dev,
              %s",<o:p></o:p></p>
            <p class="MsoPlainText">>
              -                                                       r,
              tmp_adev->ddev->unique);<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              if (r)<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                         break;<o:p></o:p></p>
            <p class="MsoPlainText">> -                              
              }<o:p></o:p></p>
            <p class="MsoPlainText">>                      }<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">> -                   /* For XGMI
              wait for all PSP resets to complete before proceed */<o:p></o:p></p>
            <p class="MsoPlainText">> +                  /* For XGMI
              wait for all work to complete before proceed */<o:p></o:p></p>
            <p class="MsoPlainText">>                      if (!r) {<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                            
              list_for_each_entry(tmp_adev, device_list_handle,<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                                    
                  gmc.xgmi.head) {<o:p></o:p></p>
            <p class="MsoPlainText">> @@ -3826,11 +3836,59 @@ static
              int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                                     r =
              tmp_adev->asic_reset_res;<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                                     if
              (r)<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                                                
              break;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                     
              if(AMD_RESET_METHOD_BACO ==<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                        
              amdgpu_asic_reset_method(tmp_adev))<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
              tmp_adev->in_baco = true;<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                                         }<o:p></o:p></p>
            <p class="MsoPlainText">>  
                                             }<o:p></o:p></p>
            <p class="MsoPlainText">>                      }<o:p></o:p></p>
            <p class="MsoPlainText">> -       }<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">> +                  /*<o:p></o:p></p>
            <p class="MsoPlainText">> +                  * For XGMI
              with baco reset, need exit baco phase by scheduling<o:p></o:p></p>
            <p class="MsoPlainText">> +                  *
              xgmi_reset_work one more time. PSP reset skips this phase.<o:p></o:p></p>
            <p class="MsoPlainText">> +                  * Not assume
              the situation that PSP reset and baco reset<o:p></o:p></p>
            <p class="MsoPlainText">> +                  * coexist
              within an XGMI hive.<o:p></o:p></p>
            <p class="MsoPlainText">> +                  */<o:p></o:p></p>
            <p class="MsoPlainText">> +<o:p></o:p></p>
            <p class="MsoPlainText">> +                  if (!r) {<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              cpu = smp_processor_id();<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              list_for_each_entry(tmp_adev, device_list_handle,<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                         
              gmc.xgmi.head) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          if
              (tmp_adev->gmc.xgmi.num_physical_nodes > 1<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                              &&
              AMD_RESET_METHOD_BACO ==<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                             
              amdgpu_asic_reset_method(tmp_adev)) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                      if
              (!queue_work_on(cpu,<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
              system_highpri_wq,<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                     
                          &tmp_adev->xgmi_reset_work))<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
              r = -EALREADY;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                      if
              (r)<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
              break;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                      cpu
              = cpumask_next(cpu, cpu_online_mask);<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          }<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              }<o:p></o:p></p>
            <p class="MsoPlainText">> +                  }<o:p></o:p></p>
            <p class="MsoPlainText">> +<o:p></o:p></p>
            <p class="MsoPlainText">> +                  if (!r) {<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              list_for_each_entry(tmp_adev, device_list_handle,<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                         
              gmc.xgmi.head) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          if
              (tmp_adev->gmc.xgmi.num_physical_nodes > 1<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                              &&
              AMD_RESET_METHOD_BACO ==<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                             
              amdgpu_asic_reset_method(tmp_adev)) {<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                     
              flush_work(&tmp_adev->xgmi_reset_work);<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                      r =
              tmp_adev->asic_reset_res;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                      if
              (r)<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                                 
              break;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                                     
              tmp_adev->in_baco = false;<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          }<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              }<o:p></o:p></p>
            <p class="MsoPlainText">> +                  }<o:p></o:p></p>
            <p class="MsoPlainText">> +<o:p></o:p></p>
            <p class="MsoPlainText">> +                  if (r) {<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              DRM_ERROR("ASIC reset failed with error, %d for drm dev,
              %s",<o:p></o:p></p>
            <p class="MsoPlainText">>
              +                                          r,
              tmp_adev->ddev->unique);<o:p></o:p></p>
            <p class="MsoPlainText">> +                             
              goto end;<o:p></o:p></p>
            <p class="MsoPlainText">> +                  }<o:p></o:p></p>
            <p class="MsoPlainText">> +      }<o:p></o:p></p>
            <p class="MsoPlainText">>   <o:p></o:p></p>
            <p class="MsoPlainText">>         
              list_for_each_entry(tmp_adev, device_list_handle,
              gmc.xgmi.head) {<o:p></o:p></p>
            <p class="MsoPlainText">>                      if
              (need_full_reset) {<o:p></o:p></p>
          </blockquote>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>