[PATCH] drm/amdgpu: Fix Manual Execution of Cleaner Shader in Gang Submissions
Alex Deucher
alexdeucher at gmail.com
Wed Mar 26 13:55:59 UTC 2025
On Wed, Mar 26, 2025 at 4:13 AM Christian König
<christian.koenig at amd.com> wrote:
>
> Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam:
> > This commit addresses the issue where the cleaner shader was not
> > correctly executed during gang submissions due to improper handling of
> > the isolation spearhead.
> >
> > - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to
> > initialize `isolation->spearhead` with the job's scheduled fence for
> > cleaner shader calls.
> > - Updated the `amdgpu_vm_flush` function to properly initialize
> > `isolation->spearhead` to the job's scheduled fence when the cleaner
> > shader is required.
> >
> > Cc: Christian König <christian.koenig at amd.com>
> > Cc: Alex Deucher <alexander.deucher at amd.com>
> > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++-
> > 2 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > index 72af5e5a894a..807f17093006 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > @@ -1436,6 +1436,7 @@ static ssize_t amdgpu_gfx_get_available_compute_partition(struct device *dev,
> > static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > + struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
> > struct drm_gpu_scheduler *sched = &ring->sched;
> > struct drm_sched_entity entity;
> > struct dma_fence *f;
> > @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
> > ib->ptr[i] = ring->funcs->nop;
> > ib->length_dw = ring->funcs->align_mask + 1;
> >
> > + if (job->base.s_fence)
> > + isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
> > +
>
> Apart from being very risky because of not grabbing locks that will leak the previous isolation->spearhead fence.
>
> > f = amdgpu_job_submit(job);
> >
> > r = dma_fence_wait(f, false);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index b5ddfcbbc9fc..e23400b53489 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,
> > if (need_pipe_sync)
> > amdgpu_ring_emit_pipeline_sync(ring);
> >
> > - if (cleaner_shader_needed)
> > + if (cleaner_shader_needed) {
> > + isolation->spearhead = dma_fence_get(&job->base.s_fence->scheduled);
>
> Same here.
>
> Over all this change doesn't seem to make much sense to me.
>
> Why exactly is isolation->spearhead not pointing to the dummy kernel job we submit?
Does the owner check or gang_submit check in
amdgpu_device_enforce_isolation() fail to set up the spearhead?
if (isolation->owner != owner) {
if (!job->gang_submit) {
dep = amdgpu_device_get_gang(adev);
if (!dma_fence_is_signaled(dep))
goto out_return_dep;
dma_fence_put(dep);
}
Alex
>
> Regards,
> Christian.
>
> > ring->funcs->emit_cleaner_shader(ring);
> > + }
> >
> > if (vm_flush_needed) {
> > trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr);
>
More information about the amd-gfx
mailing list