[PATCH] drm/amdgpu: Fix Manual Execution of Cleaner Shader in Gang Submissions

Christian König christian.koenig at amd.com
Wed Mar 26 13:58:37 UTC 2025


Am 26.03.25 um 14:55 schrieb Alex Deucher:
> On Wed, Mar 26, 2025 at 4:13 AM Christian König
> <christian.koenig at amd.com> wrote:
>> Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam:
>>> This commit addresses the issue where the cleaner shader was not
>>> correctly executed during gang submissions due to improper handling of
>>> the isolation spearhead.
>>>
>>> - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to
>>>   initialize `isolation->spearhead` with the job's scheduled fence for
>>>   cleaner shader calls.
>>> - Updated the `amdgpu_vm_flush` function to properly initialize
>>>   `isolation->spearhead` to the job's scheduled fence when the cleaner
>>>   shader is required.
>>>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam at amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 4 +++-
>>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index 72af5e5a894a..807f17093006 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -1436,6 +1436,7 @@ static ssize_t amdgpu_gfx_get_available_compute_partition(struct device *dev,
>>>  static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
>>>  {
>>>       struct amdgpu_device *adev = ring->adev;
>>> +     struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
>>>       struct drm_gpu_scheduler *sched = &ring->sched;
>>>       struct drm_sched_entity entity;
>>>       struct dma_fence *f;
>>> @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
>>>               ib->ptr[i] = ring->funcs->nop;
>>>       ib->length_dw = ring->funcs->align_mask + 1;
>>>
>>> +     if (job->base.s_fence)
>>> +             isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
>>> +
>> Apart from being very risky because of not grabbing locks that will leak the previous isolation->spearhead fence.
>>
>>>       f = amdgpu_job_submit(job);
>>>
>>>       r = dma_fence_wait(f, false);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index b5ddfcbbc9fc..e23400b53489 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,
>>>       if (need_pipe_sync)
>>>               amdgpu_ring_emit_pipeline_sync(ring);
>>>
>>> -     if (cleaner_shader_needed)
>>> +     if (cleaner_shader_needed) {
>>> +             isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
>> Same here.
>>
>> Over all this change doesn't seem to make much sense to me.
>>
>> Why exactly is isolation->spearhead not pointing to the dummy kernel job we submit?
> Does the owner check or gang_submit check in
> amdgpu_device_enforce_isolation() fail to set up the spearhead?

I'm currently debugging exactly that.

Good news is that I can reproduce the problem.

Regards,
Christian.

>
>  if (isolation->owner != owner) {
>                 if (!job->gang_submit) {
>                         dep = amdgpu_device_get_gang(adev);
>                         if (!dma_fence_is_signaled(dep))
>                                 goto out_return_dep;
>             dma_fence_put(dep);
>                 }
>
>
> Alex
>
>> Regards,
>> Christian.
>>
>>>               ring->funcs->emit_cleaner_shader(ring);
>>> +     }
>>>
>>>       if (vm_flush_needed) {
>>>               trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr);



More information about the amd-gfx mailing list