[PATCH] drm/amdgpu: Fix Manual Execution of Cleaner Shader in Gang Submissions
SRINIVASAN SHANMUGAM
srinivasan.shanmugam at amd.com
Fri Mar 28 15:26:30 UTC 2025
On 3/28/2025 8:28 PM, Alex Deucher wrote:
> On Thu, Mar 27, 2025 at 9:50 AM Christian König
> <christian.koenig at amd.com> wrote:
>> Am 27.03.25 um 10:37 schrieb SRINIVASAN SHANMUGAM:
>>
>> On 3/27/2025 2:54 PM, Christian König wrote:
>>
>> Over all this change doesn't seem to make much sense to me.
>>
>> Why exactly is isolation->spearhead not pointing to the dummy kernel job we submit?
>>
>> Does the owner check or gang_submit check in
>> amdgpu_device_enforce_isolation() fail to set up the spearhead?
>>
>> I'm currently debugging exactly that.
>>
>> Good news is that I can reproduce the problem.
>>
>>
>> I have to take that back. I've tested the cleaner shader functionality a bit this morning and as far as I can see this works exactly as intended.
>>
>> Srini, what exactly is your use case which doesn't work?
>>
>> Hi Christian, Good Morning!
>>
>> The usecase is to trigger the cleaner shader, using sysfs "run_cleaner_shader" independent of enabling "enforce_isolation", so that cleaner shader packet gets submitted to COMP_1.0.0 ring by default, without prior enabling any enforce_isolation via sysfs,
>>
>>
>> I've tested exactly that and it seems to work perfectly fine:
>> kworker/u96:1-209 [020] ..... 86.655999: amdgpu_isolation: prev=0000000000000000, next=ffffffffffffffff
>> kworker/u96:1-209 [020] ..... 86.656190: amdgpu_cleaner_shader: ring=gfx_0.0.0, seqno=2
>> <...>-11 [022] ..... 150.607688: amdgpu_isolation: prev=ffffffffffffffff, next=0000000000000000
>> kworker/u96:0-11 [022] ..... 150.608228: amdgpu_cleaner_shader: ring=comp_1.0.0, seqno=2
>> kworker/u96:0-11 [022] ..... 150.620597: amdgpu_isolation: prev=0000000000000000, next=ffffffffffffffff
>> kworker/u96:0-11 [022] ..... 150.620624: amdgpu_cleaner_shader: ring=gfx_0.0.0, seqno=1527
>>
>>
>> The only thing which might be confusing is that when you issue the cleaner shader multiple times when the GPU is idle it would only run once.
>>
>> But that should be easy to change if necessary.
> The problem is that it doesn't take into account KFD jobs. We need to
> be able to run the cleaner shader even if there have been no KGD jobs,
>
> Alex
Thank a lot for the awareness Alex!
Yeah I think since "run_cleaner_shader" sysfs entry is not associated
with any owners as it comes from kernel empty job, [Typically I used to
run "run_cleaner_shader" via sysfs (with old enforce_isolation code) in
the terminal mode, and expect cleaner shader to be triggered, and was
expecting same even with this new enforce_isolation code, prior running
a like app or for ex: IGT_COMP], So currently with this new code, it
looks like it works this way -> only if previously if any app (for ex:
IGT_COMP ran once) has submitted any jobs ie., it first checks for any
owners (IGT_COMP), if they had submitted jobs, (and in addition to this,
we don't run any "enforce_isolation" via sysfs, before running IGT_COMP
app), and now if we run the "run_cleaner_shader" sysfs entry, now it
submits the cleaner shader packet,
root at rtg-navi32:/home/rtg# ./install.sh -> Install amdgpu driver
rm ....
cp ....
unloading existing amdgpu driver ...
loading amdgpu driver ...
root at rtg-navi32:/home/rtg#
root at rtg-navi32:/home/rtg# ./run_igt_test_COMPUTE.sh
IGT-Version: 1.28-ga7ef4e2ba (x86_64) (Linux:
6.12.0-amdstaging-drm-next-lol-050225 x86_64)
Using IGT_SRANDOM=1743174485 for randomisation
Opened device: /dev/dri/card0
Initialized amdgpu, driver version 3.63
amdgpu: GFX1101 (family_id, chip_external_rev): (145, 32)
amdgpu: chip_class GFX11
Starting subtest: cs-compute-with-IP-COMPUTE
Starting dynamic subtest: cs-compute
Dynamic subtest cs-compute: SUCCESS (0.131s)
Subtest cs-compute-with-IP-COMPUTE: SUCCESS (0.131s)
root at rtg-navi32:/home/rtg#
root at rtg-navi32:/home/rtg#
root at rtg-navi32:/home/rtg#
root at rtg-navi32:/home/rtg# ./run_cleaner_shader.sh
root at rtg-navi32:/home/rtg#
Dmesg:
~$ sudo dmesg -C && sudo dmesg -w
[46759.723734] Console: switching to colour dummy device 80x25
[46759.858134] amdgpu 0000:0b:00.0: amdgpu: amdgpu: finishing device.
[46760.059772] [drm] amdgpu: ttm finalized
[46763.899566] ACPI: bus type drm_connector unregistered
[46764.223941] ACPI: bus type drm_connector registered
[46766.048868] Setting dangerous option gpu_recovery - tainting kernel
[46766.048876] Setting dangerous option no_queue_eviction_on_vm_fault -
tainting kernel
[46766.048880] Setting dangerous option halt_if_hws_hang - tainting kernel
[46766.132452] [drm] amdgpu kernel modesetting enabled.
[46766.160768] amdgpu: Virtual CRAT table created for CPU
[46766.161561] amdgpu: Topology: Add CPU node
[46766.162282] [drm] initializing kernel modesetting (IP DISCOVERY
0x1002:0x747E 0x1002:0x0E37 0xFF).
[46766.162322] [drm] register mmio base: 0xFCC00000
[46766.162325] [drm] register mmio size: 1048576
[46766.172772] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 0
<soc21_common>
[46766.172778] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 1
<gmc_v11_0>
[46766.172782] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 2
<ih_v6_0>
[46766.172786] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 3 <psp>
[46766.172789] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 4 <smu>
[46766.172793] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 5 <dm>
[46766.172796] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 6
<gfx_v11_0>
[46766.172800] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 7
<sdma_v6_0>
[46766.172803] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 8
<vcn_v4_0>
[46766.172807] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 9
<jpeg_v4_0>
[46766.172810] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 10
<mes_v11_0>
[46766.186911] amdgpu 0000:0b:00.0: No more image in the PCI ROM
[46766.186937] amdgpu 0000:0b:00.0: amdgpu: Fetched VBIOS from ROM BAR
[46766.186944] amdgpu: ATOM BIOS: 113-D7120601-4
[46766.188597] amdgpu 0000:0b:00.0: amdgpu: CP RS64 enable
[46766.190411] amdgpu 0000:0b:00.0: vgaarb: deactivate vga console
[46766.190417] amdgpu 0000:0b:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature not supported
[46766.190433] amdgpu 0000:0b:00.0: amdgpu: MODE1 reset
[46766.190437] amdgpu 0000:0b:00.0: amdgpu: GPU mode1 reset
[46766.190611] amdgpu 0000:0b:00.0: amdgpu: GPU smu mode1 reset
[46766.711756] amdgpu 0000:0b:00.0: amdgpu: MEM ECC is not presented.
[46766.711764] amdgpu 0000:0b:00.0: amdgpu: SRAM ECC is not presented.
[46766.711805] amdgpu 0000:0b:00.0: amdgpu: DF poison setting is
inconsistent(1:0:0:0)!
[46766.711811] amdgpu 0000:0b:00.0: amdgpu: Poison setting is
inconsistent in DF/UMC(0:1)!
[46766.711832] [drm] vm size is 262144 GB, 4 levels, block size is
9-bit, fragment size is 9-bit
[46766.711846] amdgpu 0000:0b:00.0: amdgpu: VRAM: 12272M
0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[46766.711852] amdgpu 0000:0b:00.0: amdgpu: GART: 512M
0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[46766.711868] [drm] Detected VRAM RAM=12272M, BAR=256M
[46766.711872] [drm] RAM width 192bits GDDR6
[46766.713643] [drm] amdgpu: 12272M of VRAM memory ready
[46766.713648] [drm] amdgpu: 7915M of GTT memory ready.
[46766.713770] [drm] GART: num cpu pages 131072, num gpu pages 131072
[46766.714036] [drm] PCIE GART of 512M enabled (table at
0x0000008000000000).
[46766.716878] [drm] Loading DMUB firmware via PSP: version=0x07002D00
[46766.716905] KGD Cleaner shader +++++++++++ Enabled cleaner shader in
gfx_v11_0_3
[46766.716908] KGD Cleaner shader +++++++++++ Initializing cleaner
shader software in gfx_v11_0_3
[46766.716918] KGD Cleaner shader +++++++++++ Cleaner shader size: 240
[46766.717280] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717284] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717376] KGD Cleaner shader +++++++++++ Exiting gfx_v11_0_sw_init
[46766.717447] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717454] amdgpu 0000:0b:00.0: amdgpu: Found VCN firmware Version
ENC: 1.23 DEC: 9 VEP: 0 Revision: 15
[46766.717591] amdgpu 0000:0b:00.0: amdgpu: Found VCN firmware Version
ENC: 1.23 DEC: 9 VEP: 0 Revision: 15
[46766.717723] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.794246] amdgpu 0000:0b:00.0: amdgpu: reserve 0xa700000 from
0x82e0000000 for PSP TMR
[46767.038502] amdgpu 0000:0b:00.0: amdgpu: RAP: optional rap ta ucode
is not available
[46767.038512] amdgpu 0000:0b:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[46767.038593] amdgpu 0000:0b:00.0: amdgpu: smu driver if version =
0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw
version = 0x00505300 (80.83.0)
[46767.038598] amdgpu 0000:0b:00.0: amdgpu: SMU driver if version not
matched
[46767.148050] amdgpu 0000:0b:00.0: amdgpu: SMU is initialized successfully!
[46767.148935] [drm] Display Core v3.2.326 initialized on DCN 3.2
[46767.148941] [drm] DP-HDMI FRL PCON supported
[46767.151156] [drm] DMUB hardware initialized: version=0x07002D00
[46767.159473] snd_hda_intel 0000:0b:00.1: bound 0000:0b:00.0 (ops
amdgpu_dm_audio_component_bind_ops [amdgpu])
[46767.193422] amdgpu 0000:0b:00.0: amdgpu: Setup CP MES MSCRATCH
address : 0x80. 0x184000
[46767.195715] KGD Cleaner shader +++++++++++ Entering
gfx11_kiq_set_resources for ring: 0000000079456a04
[46767.195727] KGD Cleaner shader +++++++++++ Cleaner shader MC address
in gfx11_kiq_set_resources: 80001030 queue_mask ffffffffffffffff
[46767.195732] KGD Cleaner shader +++++++++++ Exiting
gfx11_kiq_set_resources for ring: 0000000079456a04
[46767.286986] amdgpu: HMM registered 12272MB device memory
[46767.289999] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[46767.290025] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[46767.290386] amdgpu: Virtual CRAT table created for GPU
[46767.291758] amdgpu: Topology: Add dGPU node [0x747e:0x1002]
[46767.291765] kfd kfd: amdgpu: added device 1002:747e
[46767.291791] amdgpu 0000:0b:00.0: amdgpu: SE 3, SH per SE 2, CU per SH
10, active_cu_number 54
[46767.291824] amdgpu 0000:0b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv
eng 0 on hub 0
[46767.291829] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 1 on hub 0
[46767.291832] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 4 on hub 0
[46767.291836] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 6 on hub 0
[46767.291839] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 7 on hub 0
[46767.291842] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 8 on hub 0
[46767.291845] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 9 on hub 0
[46767.291848] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 10 on hub 0
[46767.291852] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 11 on hub 0
[46767.291855] amdgpu 0000:0b:00.0: amdgpu: ring sdma0 uses VM inv eng
12 on hub 0
[46767.291858] amdgpu 0000:0b:00.0: amdgpu: ring sdma1 uses VM inv eng
13 on hub 0
[46767.291861] amdgpu 0000:0b:00.0: amdgpu: ring vcn_unified_0 uses VM
inv eng 0 on hub 8
[46767.291864] amdgpu 0000:0b:00.0: amdgpu: ring vcn_unified_1 uses VM
inv eng 1 on hub 8
[46767.291868] amdgpu 0000:0b:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 4 on hub 8
[46767.291871] amdgpu 0000:0b:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM
inv eng 14 on hub 0
[46767.293485] amdgpu 0000:0b:00.0: amdgpu: Using BAMACO for runtime pm
[46767.300079] [drm] Initialized amdgpu 3.63.0 for 0000:0b:00.0 on minor 0
[46767.300917] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46767.300925] - Enable cleaner shader: 1
[46767.300927] - Emit cleaner shader: 0000000000000000
[46767.300931] - Job base s_fence is not NULL: 0
[46767.300934] - Job base s_fence is NULL
[46767.300936] - Isolation spearhead: 00000000712ed22d
[46767.300939] - Fence is scheduled == isolation spearhead: 0
[46767.300942] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46767.300945] AMDGPU VM Flush: No operations needed, exiting
[46767.300955] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46767.300958] - Enable cleaner shader: 1
[46767.300961] - Emit cleaner shader: 0000000000000000
[46767.300963] - Job base s_fence is not NULL: 0
[46767.300966] - Job base s_fence is NULL
[46767.300969] - Isolation spearhead: 00000000712ed22d
[46767.300972] - Fence is scheduled == isolation spearhead: 0
...
[46781.441652] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46781.441657] - Enable cleaner shader: 1
[46781.441661] - Emit cleaner shader: 00000000b46f6457
[46781.441665] - Job base s_fence is not NULL: 1
[46781.441669] - Job base s_fence address: 00000000246a799b
[46781.441673] - Job base s_fence scheduled address: 00000000246a799b
[46781.441677] - Isolation spearhead: 00000000a1dbd218
[46781.441681] - Fence is scheduled == isolation spearhead: 0
[46781.441685] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46781.441774] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46781.441779] - Enable cleaner shader: 1
[46781.441783] - Emit cleaner shader: 0000000000000000
[46781.441787] - Job base s_fence is not NULL: 1
[46781.441791] - Job base s_fence address: 0000000096d4591a
[46781.441795] - Job base s_fence scheduled address: 0000000096d4591a
[46781.441799] - Isolation spearhead: 00000000a1dbd218
[46781.441803] - Fence is scheduled == isolation spearhead: 0
[46781.441808] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46781.441812] AMDGPU VM Flush: No operations needed, exiting
[46781.441921] [IGT] amd_basic: finished subtest cs-compute, SUCCESS
[46781.442094] [IGT] amd_basic: finished subtest
cs-compute-with-IP-COMPUTE, SUCCESS
[46781.457577] [IGT] amd_basic: exiting, ret=0
[46781.474178] Console: switching to colour frame buffer device 160x64
**root at rtg-navi32:/home/rtg# ./run_cleaner_shader.sh**
[46791.806206] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46791.806215] - Enable cleaner shader: 1
[46791.806219] - Emit cleaner shader: 00000000b46f6457
[46791.806224] - Job base s_fence is not NULL: 1
[46791.806228] - Job base s_fence address: 00000000d251a46d
[46791.806232] - Job base s_fence scheduled address: 00000000d251a46d
[46791.806236] - Isolation spearhead: 00000000d251a46d
[46791.806240] - Fence is scheduled == isolation spearhead: 1
[46791.806244] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 1
[46791.806249] amdgpu 0000:0b:00.0: amdgpu: KGD Cleaner shader
+++++++++++: Emitting cleaner shader in amdgpu_vm_flush() for ring:
comp_1.0.0
[46791.806255] KGD Cleaner shader +++++++++++ Entering
gfx_v11_0_ring_emit_cleaner_shader for ring: comp_1.0.0
[46791.806260] KGD Cleaner shader +++++++++++ SENDING OUT CLEANER_SHADER
PACKET3_RUN_CLEANER_SHADER onto ring: comp_1.0.0, pipe id: 0, queue id:
0 +++++++++++++++++++++
[46791.806264] KGD Cleaner shader +++++++++++ Cleaner shader completed
on ring: comp_1.0.0 in 0 ms
[46791.806269] KGD Cleaner shader +++++++++++ Exiting
gfx_v11_0_ring_emit_cleaner_shader for ring: comp_1.0.0
[46791.806274] KGD Gangsubmit Enforce Isolation +++++++++++: Executing
cleaner shader for job 0000000051e25f9f on ring comp_1.0.0
I think some how we need to takecare of handling enforce_isolation for
kernel empty jobs going to COMP_1.0.0, before any live app or for ex:
"IGT_COMP" ie., before running any real application ie., something like
"isolation->owner != owner" in this path amdgpu_gfx_run_cleaner_shader
-> amdgpu_gfx_run_cleaner_shader_job to be this fence addresses to be
equal "&job->base.s_fence->scheduled == isolation->spearhead;"
Best regards,
Srini
>> Regards,
>> Christian.
>>
>> AFAIK, this "isolation->spearhead" initialization is not being takencare in this path "amdgpu_gfx_run_cleaner_shader -> amdgpu_gfx_run_cleaner_shader_job" (ie., when we trigger cleaner shader, using sysfs "run_cleaner_shader"), and this check "&job->base.s_fence->scheduled == isolation->spearhead;" is having the problem ie., "&job->base.s_fence->scheduled" address are is not matching with "isolation->spearhead" address, which results into zero & thus fails to emit cleaner shader, when running using "run_cleaner_shader" sysfs entry, in "amdgpu_vm_flush()" function
>>
>> Best regards,
>>
>> Srini
>>
>>
>> Regards,
>> Christian.
>>
>> Regards,
>> Christian.
>>
>>
More information about the amd-gfx
mailing list