[PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
Christian König
deathsimple at vodafone.de
Wed May 3 09:23:05 UTC 2017
> Please give me some detail approach on dma_fence, thanks !
>
I think the idea of using the fence_context is even better than using
the fence_status. It is already present for a while and filtering the
jobs/entities by it needs something like 10 lines of code.
The fence status is new and only available on newer kernels (we would
need to rebase to amd-staging-4.11).
I will try to dig up later today the how the Intel guys solved that problem.
Christian.
Am 03.05.2017 um 11:14 schrieb Liu, Monk:
>
> I thought of that for the first place as well, that keep an ID and
> parsing the ID with every job not signaled to see if the job
>
> Belongs to something guilty , But keep ctx alive is more simple so I
> choose this way.
>
> But I admit it is doable as well, and I want to compare this method
> and the dma_fence status filed you mentioned,
>
> Please give me some detail approach on dma_fence, thanks !
>
> BR Monk
>
> *From:*Christian König [mailto:deathsimple at vodafone.de]
> *Sent:* Wednesday, May 03, 2017 5:12 PM
> *To:* Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk
> <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
> > 1) fence_context have chance to incorrectly represent the context
> behind it, because the number can be used up and will wrapped
> around from beginning
>
> No, fence_context is unique in global in kernel.
>
> Yeah, I would go with that as well.
>
> You note the fence_context of the job which caused the trouble.
>
> Then take a look at all jobs on the recovery list and remove the ones
> with the same fence_context number.
>
> Then take a look at all entities on the runlist and remove the one
> with the same fence_context number. If it is already freed you won't
> find any, but that shouldn't be a problem.
>
> This way you effectively can prevent other jobs from the same context
> from running and the context query can simply check if the entity is
> still on the runlist to figure out if it was guilty or not.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 09:23 schrieb zhoucm1:
>
> On 2017年05月03日 14:02, Liu, Monk wrote:
>
> You can add ctx as filed of job, but not get reference of it,
> when you try to use ctx, just check if ctx == NULL.
> > that doesn't work at all... job->ctx will always be non-NULL
> after it is initialized, you just refer to a wild pointer
> after CTX released
>
> job->ctx is a **ctx, which could resolve your this concern.
>
>
>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
> job->fence->fence_context with
> ctx->ring[].entity->fence_context. if not found, then ctx is
> freed, otherwise you can do your things for this ctx.
> > 1) fence_context have chance to incorrectly represent the
> context behind it, because the number can be used up and will
> wrapped around from beginning
>
> No, fence_context is unique in global in kernel.
>
> Regards,
> David Zhou
>
> 2) why not just keep CTX alive, which is much simple than
> this method
>
> BR
>
> -----Original Message-----
> From: Zhou, David(ChunMing)
> Sent: Wednesday, May 03, 2017 12:54 PM
> To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> Christian König <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
> job finished
>
> You can add ctx as filed of job, but not get reference of it,
> when you try to use ctx, just check if ctx == NULL.
>
> Another stupid method:
> Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
> job->fence->fence_context with
> ctx->ring[].entity->fence_context. if not found, then ctx is
> freed, otherwise you can do your things for this ctx.
>
> Regards,
> David Zhou
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]
> On Behalf Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:57 AM
> To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> Christian König <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
> job finished
>
> @Christian,
>
> The thing is I need mark all entities behind this timeout job
> as guilty, so I use a member in entity named "ptr_guilty", to
> point to The member in amdgpu_ctx named "guilty", that way
> read *entity->ptr_guilty can get you if this entity is
> "invalid", and set *entity->ptr_guilty Can let you set all
> entities belongs to the context as "invalid".
>
> Above logic is to guarantee we can kick out all guilty
> entities in entity kfifo, and also block all IOCTL with a ctx
> handle point to this guilty context, And we only recovery
> other jobs/entities/context after scheduler unparked
>
> If you reject the patch to keep ctx alive till all jobs
> signaled, please give me a solution to satisfy above logic
>
> BR Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]
> On Behalf Of Liu, Monk
> Sent: Wednesday, May 03, 2017 11:31 AM
> To: Christian König <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
> job finished
>
> 1, This is necessary otherwise how can I access entity pointer
> after a job timedout , and why it is dangerous ?
> 2, what's the status field in the fences you were referring to
> ? I need to judge if it could satisfy my requirement
>
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 01, 2017 10:48 PM
> To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
> job finished
>
> Am 01.05.2017 um 09:22 schrieb Monk Liu:
> > for TDR guilty context feature, we need access ctx/s_entity
> field
> > member through sched job pointer,so ctx must keep alive till
> all job
> > from it signaled.
>
> NAK, that is unnecessary and quite dangerous.
>
> Instead we have the designed status field in the fences which
> should be checked for that.
>
> Regards,
> Christian.
>
> >
> > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++--
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
> > 6 files changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index e330009..8e031d6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -760,10 +760,12 @@ struct amdgpu_ib {
> > uint32_t flags;
> > };
> >
> > +struct amdgpu_ctx;
> > +
> > extern const struct amd_sched_backend_ops amdgpu_sched_ops;
> >
> > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
> num_ibs,
> > - struct amdgpu_job **job, struct amdgpu_vm
> *vm);
> > + struct amdgpu_job **job, struct amdgpu_vm
> *vm, struct
> > +amdgpu_ctx *ctx);
> > int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> unsigned size,
> > struct amdgpu_job **job);
> >
> > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
> >
> > struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv
> *fpriv, uint32_t id);
> > int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
> >
> > uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx,
> struct amdgpu_ring *ring,
> > struct fence *fence);
> > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
> > struct amdgpu_sync sync;
> > struct amdgpu_ib *ibs;
> > struct fence *fence; /* the hw fence */
> > + struct amdgpu_ctx *ctx;
> > uint32_t preamble_status;
> > uint32_t num_ibs;
> > void *owner;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 699f5fe..267fb65 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
> amdgpu_cs_parser *p, void *data)
> > }
> > }
> >
> > - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
> > + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm,
> p->ctx);
> > if (ret)
> > goto free_all_kdata;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index b4bbbb3..81438af 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -25,6 +25,13 @@
> > #include <drm/drmP.h>
> > #include "amdgpu.h"
> >
> > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx
> *ctx) {
> > + if (ctx)
> > + kref_get(&ctx->refcount);
> > + return ctx;
> > +}
> > +
> > static int amdgpu_ctx_init(struct amdgpu_device *adev,
> struct amdgpu_ctx *ctx)
> > {
> > unsigned i, j;
> > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct
> amdgpu_device *adev, struct amdgpu_ctx *ctx)
> > rq, amdgpu_sched_jobs);
> > if (r)
> > goto failed;
> > +
> > + ctx->rings[i].entity.ptr_guilty =
> &ctx->guilty; /* kernel entity
> > +doesn't have ptr_guilty */
> > }
> >
> > return 0;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 690ef3d..208da11 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct
> amd_sched_job *s_job)
> > }
> >
> > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
> num_ibs,
> > - struct amdgpu_job **job, struct amdgpu_vm
> *vm)
> > + struct amdgpu_job **job, struct amdgpu_vm
> *vm, struct
> > +amdgpu_ctx *ctx)
> > {
> > size_t size = sizeof(struct amdgpu_job);
> >
> > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device
> *adev, unsigned num_ibs,
> > (*job)->vm = vm;
> > (*job)->ibs = (void *)&(*job)[1];
> > (*job)->num_ibs = num_ibs;
> > + (*job)->ctx = amdgpu_ctx_kref_get(ctx);
> >
> > amdgpu_sync_create(&(*job)->sync);
> >
> > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct
> amdgpu_device *adev, unsigned size,
> > {
> > int r;
> >
> > - r = amdgpu_job_alloc(adev, 1, job, NULL);
> > + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
> > if (r)
> > return r;
> >
> > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct
> amdgpu_job *job)
> > static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
> > {
> > struct amdgpu_job *job = container_of(s_job, struct
> amdgpu_job,
> > base);
> > + struct amdgpu_ctx *ctx = job->ctx;
> > +
> > + if (ctx)
> > + amdgpu_ctx_put(ctx);
> >
> > fence_put(job->fence);
> > amdgpu_sync_free(&job->sync);
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > index 6f4e31f..9100ca8 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct
> amd_gpu_scheduler *sched,
> > if (!amd_sched_entity_is_initialized(sched, entity))
> > return;
> >
> > - /**
> > - * The client will not queue more IBs during this
> fini, consume existing
> > - * queued IBs
> > - */
> > - wait_event(sched->job_scheduled,
> amd_sched_entity_is_idle(entity));
> > -
> > amd_sched_rq_remove_entity(rq, entity);
> > kfifo_free(&entity->job_queue);
> > }
> > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > index 8cb41d3..ccbbcb0 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > @@ -49,6 +49,7 @@ struct amd_sched_entity {
> >
> > struct fence *dependency;
> > struct fence_cb cb;
> > + bool *ptr_guilty;
> > };
> >
> > /**
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170503/a3f1d085/attachment-0001.html>
More information about the amd-gfx
mailing list