[PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
zhoucm1
david1.zhou at amd.com
Wed May 3 07:23:44 UTC 2017
On 2017年05月03日 14:02, Liu, Monk wrote:
> You can add ctx as filed of job, but not get reference of it, when you
> try to use ctx, just check if ctx == NULL.
> > that doesn't work at all... job->ctx will always be non-NULL after
> it is initialized, you just refer to a wild pointer after CTX released
job->ctx is a **ctx, which could resolve your this concern.
>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
> job->fence->fence_context with ctx->ring[].entity->fence_context. if
> not found, then ctx is freed, otherwise you can do your things for
> this ctx.
> > 1) fence_context have chance to incorrectly represent the context
> behind it, because the number can be used up and will wrapped around
> from beginning
No, fence_context is unique in global in kernel.
Regards,
David Zhou
> 2) why not just keep CTX alive, which is much simple than this method
>
> BR
>
> -----Original Message-----
> From: Zhou, David(ChunMing)
> Sent: Wednesday, May 03, 2017 12:54 PM
> To: Liu, Monk <Monk.Liu at amd.com>; Liu, Monk <Monk.Liu at amd.com>;
> Christian König <deathsimple at vodafone.de>; amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> You can add ctx as filed of job, but not get reference of it, when you
> try to use ctx, just check if ctx == NULL.
>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
> job->fence->fence_context with ctx->ring[].entity->fence_context. if
> not found, then ctx is freed, otherwise you can do your things for
> this ctx.
>
> Regards,
> David Zhou
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
> Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:57 AM
> To: Liu, Monk <Monk.Liu at amd.com>; Christian König
> <deathsimple at vodafone.de>; amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> @Christian,
>
> The thing is I need mark all entities behind this timeout job as
> guilty, so I use a member in entity named "ptr_guilty", to point to
> The member in amdgpu_ctx named "guilty", that way read
> *entity->ptr_guilty can get you if this entity is "invalid", and set
> *entity->ptr_guilty Can let you set all entities belongs to the
> context as "invalid".
>
> Above logic is to guarantee we can kick out all guilty entities in
> entity kfifo, and also block all IOCTL with a ctx handle point to this
> guilty context, And we only recovery other jobs/entities/context after
> scheduler unparked
>
> If you reject the patch to keep ctx alive till all jobs signaled,
> please give me a solution to satisfy above logic
>
> BR Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
> Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:31 AM
> To: Christian König <deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> 1, This is necessary otherwise how can I access entity pointer after a
> job timedout , and why it is dangerous ?
> 2, what's the status field in the fences you were referring to ? I
> need to judge if it could satisfy my requirement
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 01, 2017 10:48 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> Am 01.05.2017 um 09:22 schrieb Monk Liu:
> > for TDR guilty context feature, we need access ctx/s_entity field
> > member through sched job pointer,so ctx must keep alive till all job
> > from it signaled.
>
> NAK, that is unnecessary and quite dangerous.
>
> Instead we have the designed status field in the fences which should
> be checked for that.
>
> Regards,
> Christian.
>
> >
> > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++--
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
> > 6 files changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index e330009..8e031d6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -760,10 +760,12 @@ struct amdgpu_ib {
> > uint32_t flags;
> > };
> >
> > +struct amdgpu_ctx;
> > +
> > extern const struct amd_sched_backend_ops amdgpu_sched_ops;
> >
> > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> > - struct amdgpu_job **job, struct amdgpu_vm *vm);
> > + struct amdgpu_job **job, struct amdgpu_vm *vm,
> struct
> > +amdgpu_ctx *ctx);
> > int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned
> size,
> > struct amdgpu_job **job);
> >
> > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
> >
> > struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv,
> uint32_t id);
> > int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
> >
> > uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct
> amdgpu_ring *ring,
> > struct fence *fence);
> > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
> > struct amdgpu_sync sync;
> > struct amdgpu_ib *ibs;
> > struct fence *fence; /* the hw fence */
> > + struct amdgpu_ctx *ctx;
> > uint32_t preamble_status;
> > uint32_t num_ibs;
> > void *owner;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 699f5fe..267fb65 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
> amdgpu_cs_parser *p, void *data)
> > }
> > }
> >
> > - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> > + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
> > if (ret)
> > goto free_all_kdata;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index b4bbbb3..81438af 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -25,6 +25,13 @@
> > #include <drm/drmP.h>
> > #include "amdgpu.h"
> >
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
> > + if (ctx)
> > + kref_get(&ctx->refcount);
> > + return ctx;
> > +}
> > +
> > static int amdgpu_ctx_init(struct amdgpu_device *adev, struct
> amdgpu_ctx *ctx)
> > {
> > unsigned i, j;
> > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device
> *adev, struct amdgpu_ctx *ctx)
> > rq, amdgpu_sched_jobs);
> > if (r)
> > goto failed;
> > +
> > + ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /*
> kernel entity
> > +doesn't have ptr_guilty */
> > }
> >
> > return 0;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 690ef3d..208da11 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct
> amd_sched_job *s_job)
> > }
> >
> > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> > - struct amdgpu_job **job, struct amdgpu_vm *vm)
> > + struct amdgpu_job **job, struct amdgpu_vm *vm,
> struct
> > +amdgpu_ctx *ctx)
> > {
> > size_t size = sizeof(struct amdgpu_job);
> >
> > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev,
> unsigned num_ibs,
> > (*job)->vm = vm;
> > (*job)->ibs = (void *)&(*job)[1];
> > (*job)->num_ibs = num_ibs;
> > + (*job)->ctx = amdgpu_ctx_kref_get(ctx);
> >
> > amdgpu_sync_create(&(*job)->sync);
> >
> > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device
> *adev, unsigned size,
> > {
> > int r;
> >
> > - r = amdgpu_job_alloc(adev, 1, job, NULL);
> > + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
> > if (r)
> > return r;
> >
> > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job
> *job)
> > static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
> > {
> > struct amdgpu_job *job = container_of(s_job, struct amdgpu_job,
> > base);
> > + struct amdgpu_ctx *ctx = job->ctx;
> > +
> > + if (ctx)
> > + amdgpu_ctx_put(ctx);
> >
> > fence_put(job->fence);
> > amdgpu_sync_free(&job->sync);
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > index 6f4e31f..9100ca8 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct
> amd_gpu_scheduler *sched,
> > if (!amd_sched_entity_is_initialized(sched, entity))
> > return;
> >
> > - /**
> > - * The client will not queue more IBs during this fini,
> consume existing
> > - * queued IBs
> > - */
> > - wait_event(sched->job_scheduled,
> amd_sched_entity_is_idle(entity));
> > -
> > amd_sched_rq_remove_entity(rq, entity);
> > kfifo_free(&entity->job_queue);
> > }
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > index 8cb41d3..ccbbcb0 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > @@ -49,6 +49,7 @@ struct amd_sched_entity {
> >
> > struct fence *dependency;
> > struct fence_cb cb;
> > + bool *ptr_guilty;
> > };
> >
> > /**
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170503/e0c921c4/attachment-0001.html>
More information about the amd-gfx
mailing list