[PATCH 2/2] drm: add tdr support for embeded hw_fence
Jingwen Chen
Jingwen.Chen2 at amd.com
Thu Jul 22 15:08:47 UTC 2021
On Thu Jul 22, 2021 at 10:45:40AM -0400, Andrey Grodzovsky wrote:
>
> On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> > On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> > > On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > > > [Why]
> > > > After embeded hw_fence to amdgpu_job, we need to add tdr support
> > > > for this feature.
> > > >
> > > > [How]
> > > > 1. Add a resubmit_flag for resubmit jobs.
> > > > 2. Clear job fence from RCU and force complete vm flush fences in
> > > > pre_asic_reset
> > > > 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
> > > > for guilty jobs.
> > > >
> > > > Signed-off-by: Jack Zhang <Jack.Zhang1 at amd.com>
> > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
> > > > ---
> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++-
> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 +++++++++++-----
> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 4 +++-
> > > > drivers/gpu/drm/scheduler/sched_main.c | 1 +
> > > > include/drm/gpu_scheduler.h | 1 +
> > > > 5 files changed, 27 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 40461547701a..fe0237f72a09 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> > > > int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > struct amdgpu_reset_context *reset_context)
> > > > {
> > > > - int i, r = 0;
> > > > + int i, j, r = 0;
> > > > struct amdgpu_job *job = NULL;
> > > > bool need_full_reset =
> > > > test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > > > @@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > if (!ring || !ring->sched.thread)
> > > > continue;
> > > > + /*clear job fence from fence drv to avoid force_completion
> > > > + *leave NULL and vm flush fence in fence drv */
> > > > + for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
> > > > + struct dma_fence *old,**ptr;
> > > > + ptr = &ring->fence_drv.fences[j];
> > > > + old = rcu_dereference_protected(*ptr, 1);
> > > > + if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, &old->flags))) {
> > > > + RCU_INIT_POINTER(*ptr, NULL);
> > > > + }
> > >
> > > Is this to avoid premature job free because of dma_fence_put inside
> > > amdgpu_fence_process ?
> > > I can't currently remember why but we probably want all the HW fences
> > > currently in the ring to
> > > be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
> > > inside amdgpu_fence_process
> > > and still do the signaling but not the dma_fence_put part
> > >
> > > Andrey
> > Hi Andrey,
> >
> > This is to avoid signaling the same fence twice. If we still do the
> > signaling, then the job in the pending list will be signaled first in
> > force_completion, and later be signaled in resubmit. This will go to
> > BUG() in amdgpu_fence_process.
>
>
> Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting it
> before calling
> amdgpu_fence_driver_force_completion and resetting it after, then inside
> amdgpu_fence_driver_force_completion
> you can just skip the signaling part with this flag for fences with
> DMA_FENCE_FLAG_USER_BITS set
> Less lines of code at least.
>
> Andrey
Hi Andrey,
In this way, this issue still exists. If we just skip it in the
force_completion, these fences are still in the RCU fence array. So when
the FLR finishes, the eop interrupt function of ib_test will call
amdgpu_fence_process, the skipped fences will be signaled along with
ib_test fences and signaled again during resubmission.
Best Regards,
JingWen Chen
>
> >
> > > > + }
> > > > /* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> > > > amdgpu_fence_driver_force_completion(ring);
> > > > }
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > index eecf21d8ec33..815776c9a013 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > @@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
> > > > job->ring = ring;
> > > > }
> > > > - seq = ++ring->fence_drv.sync_seq;
> > > > - dma_fence_init(fence, &amdgpu_fence_ops,
> > > > - &ring->fence_drv.lock,
> > > > - adev->fence_context + ring->idx,
> > > > - seq);
> > > > + if (job != NULL && job->base.resubmit_flag == 1) {
> > > > + /* reinit seq for resubmitted jobs */
> > > > + seq = ++ring->fence_drv.sync_seq;
> > > > + fence->seqno = seq;
> > > > + } else {
> > > > + seq = ++ring->fence_drv.sync_seq;
> > >
> > > Seems like you could do the above line only once above if-else as it was
> > > before
> > Sure, I will modify this.
> >
> >
> > Best Regards,
> > JingWen Chen
> > > > + dma_fence_init(fence, &amdgpu_fence_ops,
> > > > + &ring->fence_drv.lock,
> > > > + adev->fence_context + ring->idx,
> > > > + seq);
> > > > + }
> > > > if (job != NULL) {
> > > > /* mark this fence has a parent job */
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > index 7c426e225b24..d6f848adc3f4 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > @@ -241,6 +241,7 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
> > > > dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */
> > > > if (finished->error < 0) {
> > > > + dma_fence_put(&job->hw_fence);
> > > > DRM_INFO("Skip scheduling IBs!\n");
> > > > } else {
> > > > r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
> > > > @@ -249,7 +250,8 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
> > > > DRM_ERROR("Error scheduling IBs (%d)\n", r);
> > > > }
> > > > - dma_fence_get(fence);
> > > > + if (!job->base.resubmit_flag)
> > > > + dma_fence_get(fence);
> > > > amdgpu_job_free_resources(job);
> > > > fence = r ? ERR_PTR(r) : fence;
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index f4f474944169..5a36ab5aea2d 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -544,6 +544,7 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
> > > > dma_fence_set_error(&s_fence->finished, -ECANCELED);
> > > > dma_fence_put(s_job->s_fence->parent);
> > > > + s_job->resubmit_flag = 1;
> > > > fence = sched->ops->run_job(s_job);
> > > > i++;
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index 4ea8606d91fe..06c101af1f71 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -198,6 +198,7 @@ struct drm_sched_job {
> > > > enum drm_sched_priority s_priority;
> > > > struct drm_sched_entity *entity;
> > > > struct dma_fence_cb cb;
> > > > + int resubmit_flag;
> > > > };
> > > > static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
More information about the amd-gfx
mailing list