[PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Wed Dec 4 16:05:32 UTC 2019


On 12/4/19 2:09 AM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *Sent:* Wednesday, December 4, 2019 2:44 AM
> *To:* Ma, Le <Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org; Zhou1, 
> Tao <Tao.Zhou1 at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Zhang, 
> Hawking <Hawking.Zhang at amd.com>
> *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> Thanks Ma, this was very helpful as I am sill not able to setup XGMI 
> hive with latest FW and VBIOS.
>
> I traced the workqueue subsystem (full log attached). Specifically 
> here is the life cycle of our 2 work items executing 
> amdgpu_device_xgmi_reset_func bellow
>
> [Le]: Thanks Andrey for the deep debug. Your feedback gave me a more 
> profound understanding on this case. My comments split as below.
>
> You were right to note they both run on came CPU (32) but they are 
> executed by different threads. Also as you see by 
> workqueue_execute_start/end timestamps they actually ran in parallel 
> and not one after another even while being assigned to the same CPU 
> and that because of thread preemption (there is at least 
> psp_v11_0_mode1_reset->msleep(500)) which yields the CPU and hence 
> allows the second work to run + I am sure that on preemptive kernel 
> one reset work would be preempted at some point anyway  and let the 
> other run.
>
> [Le]: Yes, from the trace log, the xgmi_reset_func items are assigned 
> to different work threads bound to one same CPU. And you are right 
> that cpu preemption will happen when msleep called which yield the CPU 
> to allow second work to run. That’s a great founding😊. But it’s not a 
> *real* parallel run to me because second work can only preempt to run 
> when first work go to sleep. I made an experiment here to change this 
> unique msleep to udelay, then second work item will run after first 
> item finished in a serial execuation.
>

I would expect in kernel compiled with preemption support that a running 
thread would be interrupted to let others run even when he is not 
voluntarily yields the CPU so this is strange.


> Now you had issues with BACO reset while the test I ran on your system 
> is mode1 reset and so I assumed that maybe BACO has some non 
> preempt-able busy wait which doesn't give a chance to second work 
> item's thread to run on that CPU before the first finished - but from 
> looking in the code I see smu_v11_0_baco_enter->msleep(10) so even in 
> that case the first reset work item was supposed to yield CPU after 
> BACO ENTER sent to SMU and let the other reset work do the same to the 
> second card and so i don't see how even in this case there is a serial 
> execution ?
>
> [Le]: VG20 uses old powerplay framework 
> (smu_v11_0_baco_enter->msleep(10) in swSMU framework), so no msleep 
> and no CPU preemption. BACO reset has Enter/Exit 2 phases. We expect 
> all the XGMI nodes enter BACO simultaneously instead of one after one 
> as a serial execution, then exit BACO simultaneously.
>

Well, we always can add something like bellow to force each XGMI reset 
work to let others run before going into BACO exit. We can also 
guarantee that all of the reset works will execute BACO ENTER before 
proceeding to BACO EXIT by using some kind of semaphore barrier along 
the line of this - 
https://stackoverflow.com/questions/47522174/reusable-barrier-implementation-using-posix-semaphores. 
This will also solve the #XGMI_NODES > #CPUs use case.


diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 48649f5..3e91e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -531,6 +531,8 @@ static int soc15_asic_baco_reset(struct 
amdgpu_device *adev)
                 if (pp_funcs->set_asic_baco_state(pp_handle, 1))
                         return -EIO;

+               yield();
+
                 /* exit BACO state */
                 if (pp_funcs->set_asic_baco_state(pp_handle, 0))
                         return -EIO;


> P.S How you solution solves the case where the XGMI hive is bigger 
> then number of CPUs on the system ? Assuming that what you say is 
> correct and there is a serial execution when on the same CPU, if they 
> hive is bigger then number of CPUs you will eventually get back to 
> sending reset work to a CPU already executing BACO ENTER (or EXIT) for 
> another device and will get the serialization problem anyway.
>
> [Le]: Yeah, I also considered the situation that XGMI hive bigger than 
> CPU NR. I think it’s an extreme situation and should not exist. 
> However, assuming it exists, many work items scatter in several CPUs 
> will be executed faster than bound to one same CPU, isn’t it ?
>

AFAIK it's enough for even single one node in the hive to to fail the 
enter the BACO state on time to fail the entire hive reset procedure, no ?

Any way - I see our discussion blocks your entire patch set - I think 
you can go ahead and commit yours way (I think you got an RB from 
Hawking) and I will look then and see if I can implement my method and 
if it works will just revert your patch.

Andrey


>              cat-3002  [032] d... 33153.791829: workqueue_queue_work: 
> work struct=00000000e43c1ebb function=amdgpu_device_xgmi_reset_func 
> [amdgpu] workqueue=0000000080331d91 req_cpu=8192 cpu=32
>              cat-3002  [032] d... 33153.791829: 
> workqueue_activate_work: work struct 00000000e43c1ebb
>              cat-3002  [032] dN.. 33153.791831: workqueue_queue_work: 
> work struct=00000000e67113aa function=amdgpu_device_xgmi_reset_func 
> [amdgpu] workqueue=0000000080331d91 req_cpu=8192 cpu=32
>              cat-3002  [032] dN.. 33153.791832: 
> workqueue_activate_work: work struct 00000000e67113aa
>    kworker/32:1H-551   [032] .... 33153.791834: 
> workqueue_execute_start: work struct 00000000e43c1ebb: function 
> amdgpu_device_xgmi_reset_func [amdgpu]
>    kworker/32:0H-175   [032] .... 33153.792087: 
> workqueue_execute_start: work struct 00000000e67113aa: function 
> amdgpu_device_xgmi_reset_func [amdgpu]
>    kworker/32:1H-551   [032] .... 33154.310948: workqueue_execute_end: 
> work struct 00000000e43c1ebb
>    kworker/32:0H-175   [032] .... 33154.311043: workqueue_execute_end: 
> work struct 00000000e67113aa
>
> Andrey
>
> On 12/3/19 5:06 AM, Ma, Le wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     Hi Andrey,
>
>     You can try the XGMI system below:
>
>     IP: 10.67.69.53
>
>     U/P: jenkins/0
>
>     The original drm-next kernel is installed.
>
>     Regards,
>
>     Ma Le
>
>     *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>     <mailto:Andrey.Grodzovsky at amd.com>
>     *Sent:* Tuesday, December 3, 2019 6:05 AM
>     *To:* Ma, Le <Le.Ma at amd.com> <mailto:Le.Ma at amd.com>;
>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>     *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
>     <mailto:Guchun.Chen at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>
>     <mailto:Tao.Zhou1 at amd.com>; Deucher, Alexander
>     <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
>     Li, Dennis <Dennis.Li at amd.com> <mailto:Dennis.Li at amd.com>; Zhang,
>     Hawking <Hawking.Zhang at amd.com> <mailto:Hawking.Zhang at amd.com>
>     *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
>     support for XGMI
>
>     On 12/2/19 6:42 AM, Ma, Le wrote:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>         <mailto:Andrey.Grodzovsky at amd.com>
>         *Sent:* Saturday, November 30, 2019 12:22 AM
>         *To:* Ma, Le <Le.Ma at amd.com> <mailto:Le.Ma at amd.com>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
>         <mailto:Guchun.Chen at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>
>         <mailto:Tao.Zhou1 at amd.com>; Deucher, Alexander
>         <Alexander.Deucher at amd.com>
>         <mailto:Alexander.Deucher at amd.com>; Li, Dennis
>         <Dennis.Li at amd.com> <mailto:Dennis.Li at amd.com>; Zhang, Hawking
>         <Hawking.Zhang at amd.com> <mailto:Hawking.Zhang at amd.com>
>         *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
>         reset support for XGMI
>
>         On 11/28/19 4:00 AM, Ma, Le wrote:
>
>             -----Original Message-----
>             From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>             <mailto:Andrey.Grodzovsky at amd.com>
>             Sent: Wednesday, November 27, 2019 11:46 PM
>             To: Ma, Le <Le.Ma at amd.com> <mailto:Le.Ma at amd.com>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             Cc: Chen, Guchun <Guchun.Chen at amd.com>
>             <mailto:Guchun.Chen at amd.com>; Zhou1, Tao
>             <Tao.Zhou1 at amd.com> <mailto:Tao.Zhou1 at amd.com>; Deucher,
>             Alexander <Alexander.Deucher at amd.com>
>             <mailto:Alexander.Deucher at amd.com>; Li, Dennis
>             <Dennis.Li at amd.com> <mailto:Dennis.Li at amd.com>; Zhang,
>             Hawking <Hawking.Zhang at amd.com> <mailto:Hawking.Zhang at amd.com>
>             Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
>             reset support for XGMI
>
>             On 11/27/19 4:15 AM, Le Ma wrote:
>
>             > Currently each XGMI node reset wq does not run in
>             parrallel because
>
>             > same work item bound to same cpu runs in sequence. So
>             change to bound
>
>             > the xgmi_reset_work item to different cpus.
>
>             It's not the same work item, see more bellow
>
>             >
>
>             > XGMI requires all nodes enter into baco within very
>             close proximity
>
>             > before any node exit baco. So schedule the
>             xgmi_reset_work wq twice
>
>             > for enter/exit baco respectively.
>
>             >
>
>             > The default reset code path and methods do not change
>             for vega20 production:
>
>             >    - baco reset without xgmi/ras
>
>             >    - psp reset with xgmi/ras
>
>             >
>
>             > To enable baco for XGMI/RAS case, both 2 conditions
>             below are needed:
>
>             >    - amdgpu_ras_enable=2
>
>             >    - baco-supported smu firmware
>
>             >
>
>             > The case that PSP reset and baco reset coexist within an
>             XGMI hive is
>
>             > not in the consideration.
>
>             >
>
>             > Change-Id: I9c08cf90134f940b42e20d2129ff87fba761c532
>
>             > Signed-off-by: Le Ma <le.ma at amd.com <mailto:le.ma at amd.com>>
>
>             > ---
>
>             > drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 +
>
>             > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 78
>             ++++++++++++++++++++++++++----
>
>             >   2 files changed, 70 insertions(+), 10 deletions(-)
>
>             >
>
>             > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>             > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>             > index d120fe5..08929e6 100644
>
>             > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>             > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>             > @@ -998,6 +998,8 @@ struct amdgpu_device {
>
>             > int                                           pstate;
>
>             >          /* enable runtime pm on the device */
>
>             > bool                            runpm;
>
>             > +
>
>             > + bool                                        in_baco;
>
>             >   };
>
>             >
>
>             >   static inline struct amdgpu_device
>             *amdgpu_ttm_adev(struct
>
>             > ttm_bo_device *bdev) diff --git
>
>             > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>             > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>             > index bd387bb..71abfe9 100644
>
>             > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>             > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>             > @@ -2654,7 +2654,13 @@ static void
>             amdgpu_device_xgmi_reset_func(struct work_struct *__work)
>
>             >          struct amdgpu_device *adev =
>
>             > container_of(__work, struct amdgpu_device, xgmi_reset_work);
>
>             >
>
>             > - adev->asic_reset_res =  amdgpu_asic_reset(adev);
>
>             > +      if (amdgpu_asic_reset_method(adev) ==
>             AMD_RESET_METHOD_BACO)
>
>             > + adev->asic_reset_res = (adev->in_baco == false) ?
>
>             > + amdgpu_device_baco_enter(adev->ddev) :
>
>             > + amdgpu_device_baco_exit(adev->ddev);
>
>             > +      else
>
>             > + adev->asic_reset_res = amdgpu_asic_reset(adev);
>
>             > +
>
>             >          if (adev->asic_reset_res)
>
>             > DRM_WARN("ASIC reset failed with error, %d for drm dev, %s",
>
>             >                                 adev->asic_reset_res,
>             adev->ddev->unique); @@ -3796,6 +3802,7 @@
>
>             > static int amdgpu_do_asic_reset(struct amdgpu_hive_info
>             *hive,
>
>             >          struct amdgpu_device *tmp_adev = NULL;
>
>             >          bool need_full_reset = *need_full_reset_arg,
>             vram_lost = false;
>
>             >          int r = 0;
>
>             > +      int cpu = smp_processor_id();
>
>             >
>
>             >          /*
>
>             >           * ASIC reset has to be done on all HGMI hive
>             nodes ASAP @@
>
>             > -3803,21 +3810,24 @@ static int
>             amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>
>             >           */
>
>             >          if (need_full_reset) {
>
>             > list_for_each_entry(tmp_adev, device_list_handle,
>             gmc.xgmi.head) {
>
>             > -                               /* For XGMI run all
>             resets in parallel to speed up the process */
>
>             > +                              /*
>
>             > +                              * For XGMI run all resets
>             in parallel to speed up the
>
>             > +                              * process by scheduling
>             the highpri wq on different
>
>             > +                              * cpus. For XGMI with
>             baco reset, all nodes must enter
>
>             > +                              * baco within close
>             proximity before anyone exit.
>
>             > +                              */
>
>             >                                if
>             (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
>
>             > -                                           if
>             (!queue_work(system_highpri_wq, &tmp_adev->xgmi_reset_work))
>
>             Note that tmp_adev->xgmi_reset_work (the work item) is per
>             device in XGMI hive and not the same work item. So I don't
>             see why you need to explicitly queue them on different
>             CPUs, they should run in parallel already.
>
>             Andrey
>
>             [Le]: It’s also beyond my understanding that the 2 node
>             reset work items scheduled to same cpu does not run in
>             parallel. But from the experiment result in my side, the
>             2nd work item always run after 1st work item finished.
>             Based on this result, I changed to queue them on different
>             CPUs to make sure more XGMI nodes case to run in parallel,
>             because baco requires all nodes enter baco within very
>             close proximity.
>
>             The experiment code is as following for your reference.
>             When card0 worker running, card1 worker is not observed to
>             run.
>
>         The code bellow will only test that they don't run
>         concurrently - but this doesn't mean they don't run on
>         different CPUs and threads,I don't have an XGMI setup at hand
>         to test this theory but what if there is some locking
>         dependency between them that serializes their execution ? Can
>         you just add a one line print inside
>         amdgpu_device_xgmi_reset_func that prints CPU id, thread
>         name/id and card number ?
>
>         Andrey
>
>         [Le]: I checked if directly use queue_work() several times,
>         the same CPU thread will be used. And the worker per CPU will
>         execute the item one by one. Our goal here is to make the
>         xgmi_reset_func run concurrently for XGMI BACO case. That’s
>         why I schedule them on different CPUs to run parallelly. And I
>         can share the XGMI system with you if you’d like to verify more.
>
>     I tried today to setup XGMI 2P setup to test this but weren't able
>     to load with the XGMI bridge in place (maybe faulty bridge) - so
>     yea - maybe leave me your setup before your changes (the original
>     code) so i can try to open some kernel traces that show CPU id and
>     thread id to check this. It's just so weird that system_highpri_wq
>     which is documented to be multi-cpu and multi-threaded wouldn't
>     queue those work items to different cpus and worker threads.
>
>     Andrey
>
>             +atomic_t card0_in_baco = ATOMIC_INIT(0);
>
>             +atomic_t card1_in_baco = ATOMIC_INIT(0);
>
>             +
>
>             static void amdgpu_device_xgmi_reset_func(struct
>             work_struct *__work)
>
>             {
>
>             struct amdgpu_device *adev =
>
>             container_of(__work, struct amdgpu_device, xgmi_reset_work);
>
>             + printk("lema1: card 0x%x goes into reset wq\n",
>             adev->pdev->bus->number);
>
>             + if (adev->pdev->bus->number == 0x7) {
>
>             + atomic_set(&card1_in_baco, 1);
>
>             + printk("lema1: card1 in baco from card1 view\n");
>
>             + }
>
>             +
>
>             if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)
>
>                            adev->asic_reset_res = (adev->in_baco ==
>             false) ?
>
>             amdgpu_device_baco_enter(adev->ddev) :
>
>             @@ -2664,6 +2673,23 @@ static void
>             amdgpu_device_xgmi_reset_func(struct work_struct *__work)
>
>             if (adev->asic_reset_res)
>
>             DRM_WARN("ASIC reset failed with error, %d for drm dev, %s",
>
>             adev->asic_reset_res, adev->ddev->unique);
>
>             +
>
>             + if (adev->pdev->bus->number == 0x4) {
>
>             + atomic_set(&card0_in_baco, 1);
>
>             +        printk("lema1: card0 in baco from card0 view\n");
>
>             +
>
>             + while (true)
>
>             + if (!!atomic_read(&card1_in_baco))
>
>             + break;
>
>             + printk("lema1: card1 in baco from card0 view\n");
>
>             +     }
>
>             +
>
>             + if (adev->pdev->bus->number == 0x7) {
>
>             + while (true)
>
>             + if (!!atomic_read(&card0_in_baco))
>
>             + break;
>
>             + printk("lema1: card0 in baco from card1 view\n");
>
>             + }
>
>             > +                                          if
>             (!queue_work_on(cpu, system_highpri_wq,
>
>             > +    &tmp_adev->xgmi_reset_work))
>
>             >                                                        r
>             = -EALREADY;
>
>             > +                                          cpu =
>             cpumask_next(cpu, cpu_online_mask);
>
>             >                                } else
>
>             >                                            r =
>             amdgpu_asic_reset(tmp_adev);
>
>             > -
>
>             > -                               if (r) {
>
>             > - DRM_ERROR("ASIC reset failed with error, %d for drm
>             dev, %s",
>
>             > - r, tmp_adev->ddev->unique);
>
>             > +                              if (r)
>
>             >                                            break;
>
>             > -                               }
>
>             >                      }
>
>             >
>
>             > -                   /* For XGMI wait for all PSP resets
>             to complete before proceed */
>
>             > +                  /* For XGMI wait for all work to
>             complete before proceed */
>
>             >                      if (!r) {
>
>             > list_for_each_entry(tmp_adev, device_list_handle,
>
>             >     gmc.xgmi.head) {
>
>             > @@ -3826,11 +3836,59 @@ static int
>             amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>
>             >                                                        r
>             = tmp_adev->asic_reset_res;
>
>             > if (r)
>
>             > break;
>
>             > + if(AMD_RESET_METHOD_BACO ==
>
>             > +    amdgpu_asic_reset_method(tmp_adev))
>
>             > + tmp_adev->in_baco = true;
>
>             >                                            }
>
>             >                                }
>
>             >                      }
>
>             > -       }
>
>             >
>
>             > +                  /*
>
>             > +                  * For XGMI with baco reset, need exit
>             baco phase by scheduling
>
>             > +                  * xgmi_reset_work one more time. PSP
>             reset skips this phase.
>
>             > +                  * Not assume the situation that PSP
>             reset and baco reset
>
>             > +                  * coexist within an XGMI hive.
>
>             > +                  */
>
>             > +
>
>             > +                  if (!r) {
>
>             > +                              cpu = smp_processor_id();
>
>             > + list_for_each_entry(tmp_adev, device_list_handle,
>
>             > +     gmc.xgmi.head) {
>
>             > +                                          if
>             (tmp_adev->gmc.xgmi.num_physical_nodes > 1
>
>             > + && AMD_RESET_METHOD_BACO ==
>
>             > + amdgpu_asic_reset_method(tmp_adev)) {
>
>             > + if (!queue_work_on(cpu,
>
>             > + system_highpri_wq,
>
>             > +             &tmp_adev->xgmi_reset_work))
>
>             > + r = -EALREADY;
>
>             > + if (r)
>
>             > + break;
>
>             > + cpu = cpumask_next(cpu, cpu_online_mask);
>
>             > +                                          }
>
>             > +                              }
>
>             > +                  }
>
>             > +
>
>             > +                  if (!r) {
>
>             > + list_for_each_entry(tmp_adev, device_list_handle,
>
>             > +     gmc.xgmi.head) {
>
>             > +                                          if
>             (tmp_adev->gmc.xgmi.num_physical_nodes > 1
>
>             > + && AMD_RESET_METHOD_BACO ==
>
>             > + amdgpu_asic_reset_method(tmp_adev)) {
>
>             > + flush_work(&tmp_adev->xgmi_reset_work);
>
>             > +                                                      r
>             = tmp_adev->asic_reset_res;
>
>             > + if (r)
>
>             > + break;
>
>             > + tmp_adev->in_baco = false;
>
>             > +                                          }
>
>             > +                              }
>
>             > +                  }
>
>             > +
>
>             > +                  if (r) {
>
>             > +                              DRM_ERROR("ASIC reset
>             failed with error, %d for drm dev, %s",
>
>             > +                                          r,
>             tmp_adev->ddev->unique);
>
>             > +                              goto end;
>
>             > +                  }
>
>             > +      }
>
>             >
>
>             > list_for_each_entry(tmp_adev, device_list_handle,
>             gmc.xgmi.head) {
>
>             >                      if (need_full_reset) {
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191204/c7241784/attachment-0001.html>


More information about the amd-gfx mailing list