[PATCH v2] drm/amdgpu/gfx9: Convert `//` to `/* ... */` in cleaner shader code

Fri Sep 27 13:44:53 UTC 2024

Am 25.09.24 um 21:45 schrieb Felix Kuehling:
>
> On 2024-09-04 07:03, Christian König wrote:
>> It could be that the automated testing tools will run amok on that.
>>
>> I need to ask Felix for details but I think the code for the CWSR 
>> trap handlers were converted to C style comments as well because of 
>> that.
>
> I'm not aware of that. I see both comment styles in 
> amdkfd/cwsr_trap_handler_*.asm.

Oh, I was assuming that this is the C header with the binary variant and 
not the asm file.

My fault, in that case it indeed doesn't matter.

Regards,
Christian.

>
>
> Regards,
>   Felix
>
>
>>
>> Christian.
>>
>> Am 04.09.24 um 12:52 schrieb Deucher, Alexander:
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>
>>> I think you can drop this patch.  This code is not compiled, it's 
>>> just for reference, and changing all of the comments will just make 
>>> it harder to keep in sync with the internal version.
>>>
>>> Alex
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>> *From:* SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM at amd.com>
>>> *Sent:* Wednesday, September 4, 2024 5:51 AM
>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Deucher, 
>>> Alexander <Alexander.Deucher at amd.com>
>>> *Cc:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>; 
>>> SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM at amd.com>
>>> *Subject:* [PATCH v2] drm/amdgpu/gfx9: Convert `//` to `/* ... */` 
>>> in cleaner shader code
>>> This commit updates the comment style in the cleaner shader code from
>>> `//` to `/* ... */` to adhere to the Linux kernel coding style.
>>>
>>> The comments describe the operation of the cleaner shader, which is 
>>> used
>>> to clean LDS, SGPRs, and VGPRs. The shader uses two kernels launched
>>> separately to clean VGPRs, LDS, and lower SGPRs, and to clean remaining
>>> SGPRs.
>>>
>>> Fixes: 3b721dfb2c95 ("drm/amdgpu/gfx9: Add cleaner shader for 
>>> GFX9.4.3")
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam at amd.com>
>>> ---
>>> v2:
>>>  - Corrected typo for iteraions
>>>  - Added fixes tag
>>>
>>>  .../amd/amdgpu/gfx_v9_4_3_cleaner_shader.asm  | 139 +++++++++---------
>>>  1 file changed, 72 insertions(+), 67 deletions(-)
>>>
>>> diff --git 
>>> a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3_cleaner_shader.asm 
>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3_cleaner_shader.asm
>>> index d5325ef80ab0..8951b5a87ae1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3_cleaner_shader.asm
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3_cleaner_shader.asm
>>> @@ -21,47 +21,52 @@
>>>   * OTHER DEALINGS IN THE SOFTWARE.
>>>   */
>>>
>>> -// This shader is to clean LDS, SGPRs and VGPRs. It is  first 64 
>>> Dwords or 256 bytes of 192 Dwords cleaner shader.
>>> -//To turn this shader program on for complitaion change this to 
>>> main and lower shader main to main_1
>>> -
>>> -// MI300 : Clear SGPRs, VGPRs and LDS
>>> -//   Uses two kernels launched separately:
>>> -//   1. Clean VGPRs, LDS, and lower SGPRs
>>> -//        Launches one workgroup per CU, each workgroup with 4x 
>>> wave64 per SIMD in the CU
>>> -//        Waves are "wave64" and have 128 VGPRs each, which uses 
>>> all 512 VGPRs per SIMD
>>> -//        Waves in the workgroup share the 64KB of LDS
>>> -//        Each wave clears SGPRs 0 - 95. Because there are 4 
>>> waves/SIMD, this is physical SGPRs 0-383
>>> -//        Each wave clears 128 VGPRs, so all 512 in the SIMD
>>> -//        The first wave of the workgroup clears its 64KB of LDS
>>> -//        The shader starts with "S_BARRIER" to ensure SPI has 
>>> launched all waves of the workgroup
>>> -//          before any wave in the workgroup could end. Without 
>>> this, it is possible not all SGPRs get cleared.
>>> -//    2. Clean remaining SGPRs
>>> -//        Launches a workgroup with 24 waves per workgroup, 
>>> yielding 6 waves per SIMD in each CU
>>> -//        Waves are allocating 96 SGPRs
>>> -//          CP sets up SPI_RESOURCE_RESERVE_* registers to prevent 
>>> these waves from allocating SGPRs 0-223.
>>> -//          As such, these 6 waves per SIMD are allocated physical 
>>> SGPRs 224-799
>>> -//        Barriers do not work for >16 waves per workgroup, so we 
>>> cannot start with S_BARRIER
>>> -//          Instead, the shader starts with an S_SETHALT 1. Once 
>>> all waves are launched CP will send unhalt command
>>> -//        The shader then clears all SGPRs allocated to it, 
>>> cleaning out physical SGPRs 224-799
>>> +/*
>>> + * This shader is to clean LDS, SGPRs and VGPRs. It is  first 64 
>>> Dwords or 256 bytes of 192 Dwords cleaner shader.
>>> + * To turn this shader program on for complitaion change this to 
>>> main and lower shader main to main_1
>>> + *
>>> + * MI300 : Clear SGPRs, VGPRs and LDS
>>> + *   Uses two kernels launched separately:
>>> + *   1. Clean VGPRs, LDS, and lower SGPRs
>>> + *        Launches one workgroup per CU, each workgroup with 4x 
>>> wave64 per SIMD in the CU
>>> + *        Waves are "wave64" and have 128 VGPRs each, which uses 
>>> all 512 VGPRs per SIMD
>>> + *        Waves in the workgroup share the 64KB of LDS
>>> + *        Each wave clears SGPRs 0 - 95. Because there are 4 
>>> waves/SIMD, this is physical SGPRs 0-383
>>> + *        Each wave clears 128 VGPRs, so all 512 in the SIMD
>>> + *        The first wave of the workgroup clears its 64KB of LDS
>>> + *        The shader starts with "S_BARRIER" to ensure SPI has 
>>> launched all waves of the workgroup
>>> + *          before any wave in the workgroup could end. Without 
>>> this, it is possible not all SGPRs get cleared.
>>> + *    2. Clean remaining SGPRs
>>> + *        Launches a workgroup with 24 waves per workgroup, 
>>> yielding 6 waves per SIMD in each CU
>>> + *        Waves are allocating 96 SGPRs
>>> + *          CP sets up SPI_RESOURCE_RESERVE_* registers to prevent 
>>> these waves from allocating SGPRs 0-223.
>>> + *          As such, these 6 waves per SIMD are allocated physical 
>>> SGPRs 224-799
>>> + *       Barriers do not work for >16 waves per workgroup, so we 
>>> cannot start with S_BARRIER
>>> + *       Instead, the shader starts with an S_SETHALT 1. Once all 
>>> waves are launched CP will send unhalt command
>>> + *       The shader then clears all SGPRs allocated to it, cleaning 
>>> out physical SGPRs 224-799
>>> + */
>>>
>>>  shader main
>>>    asic(MI300)
>>>    type(CS)
>>>    wave_size(64)
>>> -// Note: original source code from SQ team
>>>
>>> -//   (theorhetical fastest = ~512clks vgpr + 1536 lds + ~128 sgpr  
>>> = 2176 clks)
>>> +/*
>>> + * Note: original source code from SQ team
>>> + *
>>> + * (theorhetical fastest = ~512clks vgpr + 1536 lds + ~128 sgpr  = 
>>> 2176 clks)
>>> + */
>>>
>>> -  s_cmp_eq_u32 s0, 1 // Bit0 is set, sgpr0 is set then clear VGPRS 
>>> and LDS as FW set COMPUTE_USER_DATA_3
>>> -  s_cbranch_scc0  label_0023 // Clean VGPRs and LDS if sgpr0 of 
>>> wave is set, scc = (s3 == 1)
>>> +  s_cmp_eq_u32 s0, 1 /* Bit0 is set, sgpr0 is set then clear VGPRS 
>>> and LDS as FW set COMPUTE_USER_DATA_3 */
>>> +  s_cbranch_scc0  label_0023 /* Clean VGPRs and LDS if sgpr0 of 
>>> wave is set, scc = (s3 == 1) */
>>>    S_BARRIER
>>>
>>>    s_movk_i32    m0, 0x0000
>>> -  s_mov_b32     s2, 0x00000078  // Loop 128/8=16 times  (loop 
>>> unrolled for performance)
>>> -  //
>>> -  // CLEAR VGPRs
>>> -  //
>>> -  s_set_gpr_idx_on  s2, 0x8    // enable Dest VGPR indexing
>>> +  s_mov_b32     s2, 0x00000078  /* Loop 128/8=16 times  (loop 
>>> unrolled for performance) */
>>> +/*
>>> + * CLEAR VGPRs
>>> + */
>>> +  s_set_gpr_idx_on  s2, 0x8    /* enable Dest VGPR indexing */
>>>  label_0005:
>>>    v_mov_b32     v0, 0
>>>    v_mov_b32     v1, 0
>>> @@ -75,24 +80,24 @@ label_0005:
>>>    s_set_gpr_idx_idx  s2
>>>    s_cbranch_scc0  label_0005
>>>    s_set_gpr_idx_off
>>> -
>>> -  //
>>> -  //
>>> -
>>> -  s_mov_b32     s2, 0x80000000 // Bit31 is first_wave
>>> -  s_and_b32     s2, s2, s1 // sgpr0 has tg_size (first_wave) term 
>>> as in ucode only COMPUTE_PGM_RSRC2.tg_size_en is set
>>> -  s_cbranch_scc0  label_clean_sgpr_1 // Clean LDS if its first wave 
>>> of ThreadGroup/WorkGroup
>>> -  // CLEAR LDS
>>> -  //
>>> +
>>> +  s_mov_b32     s2, 0x80000000 /* Bit31 is first_wave */
>>> +  s_and_b32     s2, s2, s1 /* sgpr0 has tg_size (first_wave) term 
>>> as in ucode only COMPUTE_PGM_RSRC2.tg_size_en is set */
>>> +  s_cbranch_scc0  label_clean_sgpr_1 /* Clean LDS if its first wave 
>>> of ThreadGroup/WorkGroup */
>>> +/*
>>> + * CLEAR LDS
>>> + */
>>>    s_mov_b32 exec_lo, 0xffffffff
>>>    s_mov_b32 exec_hi, 0xffffffff
>>> -  v_mbcnt_lo_u32_b32  v1, exec_hi, 0          // Set V1 to 
>>> thread-ID (0..63)
>>> -  v_mbcnt_hi_u32_b32  v1, exec_lo, v1         // Set V1 to 
>>> thread-ID (0..63)
>>> -  v_mul_u32_u24  v1, 0x00000008, v1           // * 8, so each 
>>> thread is a double-dword address (8byte)
>>> -  s_mov_b32     s2, 0x00000003f               // 64 loop iteraions
>>> +  v_mbcnt_lo_u32_b32  v1, exec_hi, 0          /* Set V1 to 
>>> thread-ID (0..63) */
>>> +  v_mbcnt_hi_u32_b32  v1, exec_lo, v1         /* Set V1 to 
>>> thread-ID (0..63) */
>>> +  v_mul_u32_u24  v1, 0x00000008, v1           /* * 8, so each 
>>> thread is a double-dword address (8byte) */
>>> +  s_mov_b32     s2, 0x00000003f               /* 64 loop iterations */
>>>    s_mov_b32     m0, 0xffffffff
>>> -  // Clear all of LDS space
>>> -  // Each FirstWave of WorkGroup clears 64kbyte block
>>> +/*
>>> + * Clear all of LDS space
>>> + * Each FirstWave of WorkGroup clears 64kbyte block
>>> + */
>>>
>>>  label_001F:
>>>    ds_write2_b64  v1, v[2:3], v[2:3] offset1:32
>>> @@ -100,11 +105,11 @@ label_001F:
>>>    v_add_co_u32     v1, vcc, 0x00000400, v1
>>>    s_sub_u32     s2, s2, 1
>>>    s_cbranch_scc0  label_001F
>>> -  //
>>> -  // CLEAR SGPRs
>>> -  //
>>> +/*
>>> + * CLEAR SGPRs
>>> + */
>>>  label_clean_sgpr_1:
>>> -  s_mov_b32     m0, 0x0000005c   // Loop 96/4=24 times  (loop 
>>> unrolled for performance)
>>> +  s_mov_b32     m0, 0x0000005c   /* Loop 96/4=24 times  (loop 
>>> unrolled for performance) */
>>>    s_nop 0
>>>  label_sgpr_loop:
>>>    s_movreld_b32     s0, 0
>>> @@ -114,25 +119,25 @@ label_sgpr_loop:
>>>    s_sub_u32         m0, m0, 4
>>>    s_cbranch_scc0  label_sgpr_loop
>>>
>>> -  //clear vcc, flat scratch
>>> -  s_mov_b32 flat_scratch_lo, 0   //clear  flat scratch lo SGPR
>>> -  s_mov_b32 flat_scratch_hi, 0   //clear  flat scratch hi SGPR
>>> -  s_mov_b64 vcc, 0               //clear vcc
>>> -  s_mov_b64 ttmp0, 0             //Clear ttmp0 and ttmp1
>>> -  s_mov_b64 ttmp2, 0             //Clear ttmp2 and ttmp3
>>> -  s_mov_b64 ttmp4, 0             //Clear ttmp4 and ttmp5
>>> -  s_mov_b64 ttmp6, 0             //Clear ttmp6 and ttmp7
>>> -  s_mov_b64 ttmp8, 0             //Clear ttmp8 and ttmp9
>>> -  s_mov_b64 ttmp10, 0            //Clear ttmp10 and ttmp11
>>> -  s_mov_b64 ttmp12, 0            //Clear ttmp12 and ttmp13
>>> -  s_mov_b64 ttmp14, 0            //Clear ttmp14 and ttmp15
>>> +  /* clear vcc, flat scratch */
>>> +  s_mov_b32 flat_scratch_lo, 0   /* clear flat scratch lo SGPR */
>>> +  s_mov_b32 flat_scratch_hi, 0   /* clear flat scratch hi SGPR */
>>> +  s_mov_b64 vcc, 0               /* clear vcc */
>>> +  s_mov_b64 ttmp0, 0             /* Clear ttmp0 and ttmp1 */
>>> +  s_mov_b64 ttmp2, 0             /* Clear ttmp2 and ttmp3 */
>>> +  s_mov_b64 ttmp4, 0             /* Clear ttmp4 and ttmp5 */
>>> +  s_mov_b64 ttmp6, 0             /* Clear ttmp6 and ttmp7 */
>>> +  s_mov_b64 ttmp8, 0             /* Clear ttmp8 and ttmp9 */
>>> +  s_mov_b64 ttmp10, 0            /* Clear ttmp10 and ttmp11 */
>>> +  s_mov_b64 ttmp12, 0            /* Clear ttmp12 and ttmp13 */
>>> +  s_mov_b64 ttmp14, 0            /* Clear ttmp14 and ttmp15 */
>>>  s_endpgm
>>>
>>>  label_0023:
>>>
>>>    s_sethalt 1
>>>
>>> -  s_mov_b32     m0, 0x0000005c   // Loop 96/4=24 times  (loop 
>>> unrolled for performance)
>>> +  s_mov_b32     m0, 0x0000005c   /* Loop 128/8=16 times (loop 
>>> unrolled for performance) */
>>>    s_nop 0
>>>  label_sgpr_loop1:
>>>
>>> @@ -143,10 +148,10 @@ label_sgpr_loop1:
>>>    s_sub_u32         m0, m0, 4
>>>    s_cbranch_scc0  label_sgpr_loop1
>>>
>>> -  //clear vcc, flat scratch
>>> -  s_mov_b32 flat_scratch_lo, 0   //clear  flat scratch lo SGPR
>>> -  s_mov_b32 flat_scratch_hi, 0   //clear  flat scratch hi SGPR
>>> -  s_mov_b64 vcc, 0xee            //clear vcc
>>> +  /* clear vcc, flat scratch */
>>> +  s_mov_b32 flat_scratch_lo, 0   /* clear  flat scratch lo SGPR */
>>> +  s_mov_b32 flat_scratch_hi, 0   /* clear  flat scratch hi SGPR */
>>> +  s_mov_b64 vcc, 0xee            /* clear vcc */
>>>
>>>  s_endpgm
>>>  end
>>> -- 
>>> 2.34.1
>>>
>>