[PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Wed Dec 11 14:04:45 UTC 2019

Great! I will update the patches to also use the barrier in PSP MODE 1 
reset case and resend the patches for formal review.

Andrey

On 12/11/19 7:18 AM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> I tried your new patches to run BACO for about 10 loops and the result 
> looks positive, without observing enter/exit baco message failure again.
>
> The time interval between BACO entries or exits in my environment was 
> almost less than 10 us: max 36us, min 2us. I think it’s safe enough 
> according to the sample data we collected in both sides.
>
> And it looks not necessary to continue using system_highpri_wq any 
> more because we require all the nodes enter or exit at the same time, 
> while do not mind how long the time interval is b/t enter and exit. 
> The system_unbound_wq can satisfy our requirement here since it wakes 
> different CPUs up to work at the same time.
>
> Regards,
>
> Ma Le
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *Sent:* Wednesday, December 11, 2019 3:56 AM
> *To:* Ma, Le <Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org; Zhou1, 
> Tao <Tao.Zhou1 at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Zhang, 
> Hawking <Hawking.Zhang at amd.com>
> *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> I switched the workqueue we were using for xgmi_reset_work from 
> system_highpri_wq to system_unbound_wq - the difference is that 
> workers servicing the queue in system_unbound_wq are not bounded to 
> specific CPU and so the reset jobs for each XGMI node are getting 
> scheduled to different CPU while system_highpri_wq is a bounded work 
> queue. I traced it as bellow for 10 consecutive times and didn't see 
> errors any more. Also the time diff between BACO entries or exits was 
> never more then around 2 uS.
>
> Please give this updated patchset a try
>
>    kworker/u16:2-57    [004] ...1   243.276312: trace_code: func: 
> vega20_baco_set_state, line 91 <----- - Before BEACO enter
>            <...>-60    [007] ...1   243.276312: trace_code: func: 
> vega20_baco_set_state, line 91 <----- - Before BEACO enter
>    kworker/u16:2-57    [004] ...1   243.276384: trace_code: func: 
> vega20_baco_set_state, line 105 <----- - After BEACO enter done
>            <...>-60    [007] ...1   243.276392: trace_code: func: 
> vega20_baco_set_state, line 105 <----- - After BEACO enter done
>    kworker/u16:3-60    [007] ...1   243.276397: trace_code: func: 
> vega20_baco_set_state, line 108 <----- - Before BEACO exit
>    kworker/u16:2-57    [004] ...1   243.276399: trace_code: func: 
> vega20_baco_set_state, line 108 <----- - Before BEACO exit
>    kworker/u16:3-60    [007] ...1   243.288067: trace_code: func: 
> vega20_baco_set_state, line 114 <----- - After BEACO exit done
>    kworker/u16:2-57    [004] ...1   243.295624: trace_code: func: 
> vega20_baco_set_state, line 114 <----- - After BEACO exit done
>
> Andrey
>
> On 12/9/19 9:45 PM, Ma, Le wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     I’m fine with your solution if synchronization time interval
>     satisfies BACO requirements and loop test can pass on XGMI system.
>
>     Regards,
>
>     Ma Le
>
>     *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>     <mailto:Andrey.Grodzovsky at amd.com>
>     *Sent:* Monday, December 9, 2019 11:52 PM
>     *To:* Ma, Le <Le.Ma at amd.com> <mailto:Le.Ma at amd.com>;
>     amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>; Zhou1, Tao
>     <Tao.Zhou1 at amd.com> <mailto:Tao.Zhou1 at amd.com>; Deucher, Alexander
>     <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
>     Li, Dennis <Dennis.Li at amd.com> <mailto:Dennis.Li at amd.com>; Zhang,
>     Hawking <Hawking.Zhang at amd.com> <mailto:Hawking.Zhang at amd.com>
>     *Cc:* Chen, Guchun <Guchun.Chen at amd.com> <mailto:Guchun.Chen at amd.com>
>     *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
>     support for XGMI
>
>     Thanks a lot Ma for trying - I think I have to have my own system
>     to debug this so I will keep trying enabling XGMI - i still think
>     the is the right and the generic solution for multiple nodes reset
>     synchronization and in fact the barrier should also be used for
>     synchronizing PSP mode 1 XGMI reset too.
>
>     Andrey
>
>     On 12/9/19 6:34 AM, Ma, Le wrote:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         Hi Andrey,
>
>         I tried your patches on my 2P XGMI platform. The baco can work
>         at most time, and randomly got following error:
>
>         [ 1701.542298] amdgpu: [powerplay] Failed to send message
>         0x25, response 0x0
>
>         This error usually means some sync issue exist for xgmi baco
>         case. Feel free to debug your patches on my XGMI platform.
>
>         Regards,
>
>         Ma Le
>
>         *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>         <mailto:Andrey.Grodzovsky at amd.com>
>         *Sent:* Saturday, December 7, 2019 5:51 AM
>         *To:* Ma, Le <Le.Ma at amd.com> <mailto:Le.Ma at amd.com>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>; Zhou1, Tao
>         <Tao.Zhou1 at amd.com> <mailto:Tao.Zhou1 at amd.com>; Deucher,
>         Alexander <Alexander.Deucher at amd.com>
>         <mailto:Alexander.Deucher at amd.com>; Li, Dennis
>         <Dennis.Li at amd.com> <mailto:Dennis.Li at amd.com>; Zhang, Hawking
>         <Hawking.Zhang at amd.com> <mailto:Hawking.Zhang at amd.com>
>         *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
>         <mailto:Guchun.Chen at amd.com>
>         *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
>         reset support for XGMI
>
>         Hey Ma, attached a solution - it's just compiled as I still
>         can't make my XGMI setup work (with bridge connected only one
>         device is visible to the system while the other is not).
>         Please try it on your system if you have a chance.
>
>         Andrey
>
>         On 12/4/19 10:14 PM, Ma, Le wrote:
>
>             AFAIK it's enough for even single one node in the hive to
>             to fail the enter the BACO state on time to fail the
>             entire hive reset procedure, no ?
>
>             [Le]: Yeah, agree that. I’ve been thinking that make all
>             nodes entering baco simultaneously can reduce the
>             possibility of node failure to enter/exit BACO risk. For
>             example, in an XGMI hive with 8 nodes, the total time
>             interval of 8 nodes enter/exit BACO on 8 CPUs is less than
>             the interval that 8 nodes enter BACO serially and exit
>             BACO serially depending on one CPU with yield capability.
>             This interval is usually strict for BACO feature itself.
>             Anyway, we need more looping test later on any method we
>             will choose.
>
>             Any way - I see our discussion blocks your entire patch
>             set - I think you can go ahead and commit yours way (I
>             think you got an RB from Hawking) and I will look then and
>             see if I can implement my method and if it works will just
>             revert your patch.
>
>             [Le]: OK, fine.
>
>             Andrey
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191211/d7f8e4a4/attachment-0001.htm>