[PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Ma, Le Le.Ma at amd.com
Tue Dec 10 02:45:33 UTC 2019


[AMD Official Use Only - Internal Distribution Only]

I'm fine with your solution if synchronization time interval satisfies BACO requirements and loop test can pass on XGMI system.

Regards,
Ma Le

From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
Sent: Monday, December 9, 2019 11:52 PM
To: Ma, Le <Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org; Zhou1, Tao <Tao.Zhou1 at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
Cc: Chen, Guchun <Guchun.Chen at amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI


Thanks a lot Ma for trying - I think I have to have my own system to debug this so I will keep trying enabling XGMI - i still think the is the right and the generic solution for multiple nodes reset synchronization and in fact the barrier should also be used for synchronizing PSP mode 1 XGMI reset too.

Andrey
On 12/9/19 6:34 AM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I tried your patches on my 2P XGMI platform. The baco can work at most time, and randomly got following error:
[ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, response 0x0

This error usually means some sync issue exist for xgmi baco case. Feel free to debug your patches on my XGMI platform.

Regards,
Ma Le

From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com><mailto:Andrey.Grodzovsky at amd.com>
Sent: Saturday, December 7, 2019 5:51 AM
To: Ma, Le <Le.Ma at amd.com><mailto:Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Zhou1, Tao <Tao.Zhou1 at amd.com><mailto:Tao.Zhou1 at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com><mailto:Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com><mailto:Dennis.Li at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com><mailto:Hawking.Zhang at amd.com>
Cc: Chen, Guchun <Guchun.Chen at amd.com><mailto:Guchun.Chen at amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI


Hey Ma, attached a solution - it's just compiled as I still can't make my XGMI setup work (with bridge connected only one device is visible to the system while the other is not). Please try it on your system if you have a chance.

Andrey
On 12/4/19 10:14 PM, Ma, Le wrote:

AFAIK it's enough for even single one node in the hive to to fail the enter the BACO state on time to fail the entire hive reset procedure, no ?
[Le]: Yeah, agree that. I've been thinking that make all nodes entering baco simultaneously can reduce the possibility of node failure to enter/exit BACO risk. For example, in an XGMI hive with 8 nodes, the total time interval of 8 nodes enter/exit BACO on 8 CPUs is less than the interval that 8 nodes enter BACO serially and exit BACO serially depending on one CPU with yield capability. This interval is usually strict for BACO feature itself. Anyway, we need more looping test later on any method we will choose.

Any way - I see our discussion blocks your entire patch set - I think you can go ahead and commit yours way (I think you got an RB from Hawking) and I will look then and see if I can implement my method and if it works will just revert your patch.

[Le]: OK, fine.

Andrey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191210/42214262/attachment-0001.htm>


More information about the amd-gfx mailing list