[PATCH] drm/amdgpu: Fix Manual Execution of Cleaner Shader in Gang Submissions

Alex Deucher alexdeucher at gmail.com
Wed Mar 26 13:55:59 UTC 2025


On Wed, Mar 26, 2025 at 4:13 AM Christian König
<christian.koenig at amd.com> wrote:
>
> Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam:
> > This commit addresses the issue where the cleaner shader was not
> > correctly executed during gang submissions due to improper handling of
> > the isolation spearhead.
> >
> > - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to
> >   initialize `isolation->spearhead` with the job's scheduled fence for
> >   cleaner shader calls.
> > - Updated the `amdgpu_vm_flush` function to properly initialize
> >   `isolation->spearhead` to the job's scheduled fence when the cleaner
> >   shader is required.
> >
> > Cc: Christian König <christian.koenig at amd.com>
> > Cc: Alex Deucher <alexander.deucher at amd.com>
> > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam at amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 4 +++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > index 72af5e5a894a..807f17093006 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > @@ -1436,6 +1436,7 @@ static ssize_t amdgpu_gfx_get_available_compute_partition(struct device *dev,
> >  static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
> >  {
> >       struct amdgpu_device *adev = ring->adev;
> > +     struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
> >       struct drm_gpu_scheduler *sched = &ring->sched;
> >       struct drm_sched_entity entity;
> >       struct dma_fence *f;
> > @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
> >               ib->ptr[i] = ring->funcs->nop;
> >       ib->length_dw = ring->funcs->align_mask + 1;
> >
> > +     if (job->base.s_fence)
> > +             isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
> > +
>
> Apart from being very risky because of not grabbing locks that will leak the previous isolation->spearhead fence.
>
> >       f = amdgpu_job_submit(job);
> >
> >       r = dma_fence_wait(f, false);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index b5ddfcbbc9fc..e23400b53489 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,
> >       if (need_pipe_sync)
> >               amdgpu_ring_emit_pipeline_sync(ring);
> >
> > -     if (cleaner_shader_needed)
> > +     if (cleaner_shader_needed) {
> > +             isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
>
> Same here.
>
> Over all this change doesn't seem to make much sense to me.
>
> Why exactly is isolation->spearhead not pointing to the dummy kernel job we submit?

Does the owner check or gang_submit check in
amdgpu_device_enforce_isolation() fail to set up the spearhead?

 if (isolation->owner != owner) {
                if (!job->gang_submit) {
                        dep = amdgpu_device_get_gang(adev);
                        if (!dma_fence_is_signaled(dep))
                                goto out_return_dep;
            dma_fence_put(dep);
                }


Alex

>
> Regards,
> Christian.
>
> >               ring->funcs->emit_cleaner_shader(ring);
> > +     }
> >
> >       if (vm_flush_needed) {
> >               trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr);
>


More information about the amd-gfx mailing list