[PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Mon Dec 9 15:52:14 UTC 2019


Thanks a lot Ma for trying - I think I have to have my own system to 
debug this so I will keep trying enabling XGMI - i still think the is 
the right and the generic solution for multiple nodes reset 
synchronization and in fact the barrier should also be used for 
synchronizing PSP mode 1 XGMI reset too.

Andrey

On 12/9/19 6:34 AM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi Andrey,
>
> I tried your patches on my 2P XGMI platform. The baco can work at most 
> time, and randomly got following error:
>
> [ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, 
> response 0x0
>
> This error usually means some sync issue exist for xgmi baco case. 
> Feel free to debug your patches on my XGMI platform.
>
> Regards,
>
> Ma Le
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *Sent:* Saturday, December 7, 2019 5:51 AM
> *To:* Ma, Le <Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org; Zhou1, 
> Tao <Tao.Zhou1 at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Zhang, 
> Hawking <Hawking.Zhang at amd.com>
> *Cc:* Chen, Guchun <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> Hey Ma, attached a solution - it's just compiled as I still can't make 
> my XGMI setup work (with bridge connected only one device is visible 
> to the system while the other is not). Please try it on your system if 
> you have a chance.
>
> Andrey
>
> On 12/4/19 10:14 PM, Ma, Le wrote:
>
>     AFAIK it's enough for even single one node in the hive to to fail
>     the enter the BACO state on time to fail the entire hive reset
>     procedure, no ?
>
>     [Le]: Yeah, agree that. I’ve been thinking that make all nodes
>     entering baco simultaneously can reduce the possibility of node
>     failure to enter/exit BACO risk. For example, in an XGMI hive with
>     8 nodes, the total time interval of 8 nodes enter/exit BACO on 8
>     CPUs is less than the interval that 8 nodes enter BACO serially
>     and exit BACO serially depending on one CPU with yield capability.
>     This interval is usually strict for BACO feature itself. Anyway,
>     we need more looping test later on any method we will choose.
>
>     Any way - I see our discussion blocks your entire patch set - I
>     think you can go ahead and commit yours way (I think you got an RB
>     from Hawking) and I will look then and see if I can implement my
>     method and if it works will just revert your patch.
>
>     [Le]: OK, fine.
>
>     Andrey
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191209/46aeb108/attachment.html>


More information about the amd-gfx mailing list