<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">
<blockquote type="cite"><font size="2"><span
style="font-size:10pt;">
<div class="PlainText"> > 1) fence_context have chance to
incorrectly represent the context behind it, because the
number can be used up and will wrapped around from
beginning <br>
</div>
</span></font> <font size="2">No, fence_context is unique in
global in kernel.<br>
</font></blockquote>
Yeah, I would go with that as well.<br>
<br>
You note the fence_context of the job which caused the trouble.<br>
<br>
Then take a look at all jobs on the recovery list and remove the
ones with the same fence_context number.<br>
<br>
Then take a look at all entities on the runlist and remove the one
with the same fence_context number. If it is already freed you
won't find any, but that shouldn't be a problem.<br>
<br>
This way you effectively can prevent other jobs from the same
context from running and the context query can simply check if the
entity is still on the runlist to figure out if it was guilty or
not.<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 03.05.2017 um 09:23 schrieb zhoucm1:<br>
</div>
<blockquote cite="mid:59098580.6090204@amd.com" type="cite">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<br>
<br>
<div class="moz-cite-prefix">On 2017年05月03日 14:02, Liu, Monk
wrote:<br>
</div>
<blockquote
cite="mid:DM5PR12MB161082763FA0163FF22E1C1F84160@DM5PR12MB1610.namprd12.prod.outlook.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text -->
<style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">You can add ctx as filed of job, but
not get reference of it, when you try to use ctx, just
check if ctx == NULL.<br>
> that doesn't work at all... job->ctx will always
be non-NULL after it is initialized, you just refer to a
wild pointer after CTX released <br>
</div>
</span></font></blockquote>
<font size="2">job->ctx is a **ctx, which could resolve your
this concern.<br>
<br>
</font>
<blockquote
cite="mid:DM5PR12MB161082763FA0163FF22E1C1F84160@DM5PR12MB1610.namprd12.prod.outlook.com"
type="cite"><font size="2"><span style="font-size:10pt;">
<div class="PlainText"> <br>
Another stupid method:<br>
Use idr_for_each_entry(..job->vm->ctx_mgr...) and
compare the job->fence->fence_context with
ctx->ring[].entity->fence_context. if not found,
then ctx is freed, otherwise you can do your things for
this ctx.<br>
> 1) fence_context have chance to incorrectly represent
the context behind it, because the number can be used up
and will wrapped around from beginning <br>
</div>
</span></font></blockquote>
<font size="2">No, fence_context is unique in global in kernel.<br>
<br>
Regards,<br>
David Zhou<br>
</font>
<blockquote
cite="mid:DM5PR12MB161082763FA0163FF22E1C1F84160@DM5PR12MB1610.namprd12.prod.outlook.com"
type="cite"><font size="2"><span style="font-size:10pt;">
<div class="PlainText"> 2) why not just keep CTX alive,
which is much simple than this method <br>
<br>
BR<br>
<br>
-----Original Message-----<br>
From: Zhou, David(ChunMing) <br>
Sent: Wednesday, May 03, 2017 12:54 PM<br>
To: Liu, Monk <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>;
Liu, Monk <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>;
Christian König <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:deathsimple@vodafone.de"><deathsimple@vodafone.de></a>;
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till
all job finished<br>
<br>
You can add ctx as filed of job, but not get reference of
it, when you try to use ctx, just check if ctx == NULL. <br>
<br>
Another stupid method:<br>
Use idr_for_each_entry(..job->vm->ctx_mgr...) and
compare the job->fence->fence_context with
ctx->ring[].entity->fence_context. if not found,
then ctx is freed, otherwise you can do your things for
this ctx.<br>
<br>
Regards,<br>
David Zhou<br>
<br>
-----Original Message-----<br>
From: amd-gfx [<a moz-do-not-send="true"
href="mailto:amd-gfx-bounces@lists.freedesktop.org">mailto:amd-gfx-bounces@lists.freedesktop.org</a>]
On Behalf Of Liu, Monk<br>
Sent: Wednesday, May 03, 2017 11:57 AM<br>
To: Liu, Monk <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>;
Christian König <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:deathsimple@vodafone.de"><deathsimple@vodafone.de></a>;
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till
all job finished<br>
<br>
@Christian,<br>
<br>
The thing is I need mark all entities behind this timeout
job as guilty, so I use a member in entity named
"ptr_guilty", to point to The member in amdgpu_ctx named
"guilty", that way read *entity->ptr_guilty can get you
if this entity is "invalid", and set
*entity->ptr_guilty Can let you set all entities
belongs to the context as "invalid".<br>
<br>
Above logic is to guarantee we can kick out all guilty
entities in entity kfifo, and also block all IOCTL with a
ctx handle point to this guilty context, And we only
recovery other jobs/entities/context after scheduler
unparked <br>
<br>
If you reject the patch to keep ctx alive till all jobs
signaled, please give me a solution to satisfy above logic
<br>
<br>
BR Monk<br>
<br>
-----Original Message-----<br>
From: amd-gfx [<a moz-do-not-send="true"
href="mailto:amd-gfx-bounces@lists.freedesktop.org">mailto:amd-gfx-bounces@lists.freedesktop.org</a>]
On Behalf Of Liu, Monk<br>
Sent: Wednesday, May 03, 2017 11:31 AM<br>
To: Christian König <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:deathsimple@vodafone.de"><deathsimple@vodafone.de></a>;
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till
all job finished<br>
<br>
1, This is necessary otherwise how can I access entity
pointer after a job timedout , and why it is dangerous ?<br>
2, what's the status field in the fences you were
referring to ? I need to judge if it could satisfy my
requirement <br>
<br>
<br>
<br>
-----Original Message-----<br>
From: Christian König [<a moz-do-not-send="true"
href="mailto:deathsimple@vodafone.de">mailto:deathsimple@vodafone.de</a>]<br>
Sent: Monday, May 01, 2017 10:48 PM<br>
To: Liu, Monk <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a>;
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till
all job finished<br>
<br>
Am 01.05.2017 um 09:22 schrieb Monk Liu:<br>
> for TDR guilty context feature, we need access
ctx/s_entity field <br>
> member through sched job pointer,so ctx must keep
alive till all job <br>
> from it signaled.<br>
<br>
NAK, that is unnecessary and quite dangerous.<br>
<br>
Instead we have the designed status field in the fences
which should be checked for that.<br>
<br>
Regards,<br>
Christian.<br>
<br>
><br>
> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43<br>
> Signed-off-by: Monk Liu <a moz-do-not-send="true"
class="moz-txt-link-rfc2396E"
href="mailto:Monk.Liu@amd.com"><Monk.Liu@amd.com></a><br>
> ---<br>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6
+++++-<br>
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2
+-<br>
> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9
+++++++++<br>
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9
+++++++--<br>
> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6
------<br>
> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +<br>
> 6 files changed, 23 insertions(+), 10 deletions(-)<br>
><br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
> index e330009..8e031d6 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
> @@ -760,10 +760,12 @@ struct amdgpu_ib {<br>
> uint32_t flags;<br>
> };<br>
> <br>
> +struct amdgpu_ctx;<br>
> +<br>
> extern const struct amd_sched_backend_ops
amdgpu_sched_ops;<br>
> <br>
> int amdgpu_job_alloc(struct amdgpu_device *adev,
unsigned num_ibs,<br>
> - struct amdgpu_job **job, struct
amdgpu_vm *vm);<br>
> + struct amdgpu_job **job, struct
amdgpu_vm *vm, struct <br>
> +amdgpu_ctx *ctx);<br>
> int amdgpu_job_alloc_with_ib(struct amdgpu_device
*adev, unsigned size,<br>
> struct amdgpu_job **job);<br>
> <br>
> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {<br>
> <br>
> struct amdgpu_ctx *amdgpu_ctx_get(struct
amdgpu_fpriv *fpriv, uint32_t id);<br>
> int amdgpu_ctx_put(struct amdgpu_ctx *ctx);<br>
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct
amdgpu_ctx *ctx);<br>
> <br>
> uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx
*ctx, struct amdgpu_ring *ring,<br>
> struct fence *fence);<br>
> @@ -1129,6 +1132,7 @@ struct amdgpu_job {<br>
> struct amdgpu_sync sync;<br>
> struct amdgpu_ib *ibs;<br>
> struct fence *fence; /* the hw
fence */<br>
> + struct amdgpu_ctx *ctx;<br>
> uint32_t preamble_status;<br>
> uint32_t num_ibs;<br>
> void *owner;<br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
> index 699f5fe..267fb65 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
amdgpu_cs_parser *p, void *data)<br>
> }<br>
> }<br>
> <br>
> - ret = amdgpu_job_alloc(p->adev, num_ibs,
&p->job, vm);<br>
> + ret = amdgpu_job_alloc(p->adev, num_ibs,
&p->job, vm, p->ctx);<br>
> if (ret)<br>
> goto free_all_kdata;<br>
> <br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c<br>
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c<br>
> index b4bbbb3..81438af 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c<br>
> @@ -25,6 +25,13 @@<br>
> #include <drm/drmP.h><br>
> #include "amdgpu.h"<br>
> <br>
> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct
amdgpu_ctx *ctx) {<br>
> + if (ctx)<br>
> + kref_get(&ctx->refcount);<br>
> + return ctx;<br>
> +}<br>
> +<br>
> static int amdgpu_ctx_init(struct amdgpu_device
*adev, struct amdgpu_ctx *ctx)<br>
> {<br>
> unsigned i, j;<br>
> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct
amdgpu_device *adev, struct amdgpu_ctx *ctx)<br>
> rq,
amdgpu_sched_jobs);<br>
> if (r)<br>
> goto failed;<br>
> +<br>
> + ctx->rings[i].entity.ptr_guilty =
&ctx->guilty; /* kernel entity <br>
> +doesn't have ptr_guilty */<br>
> }<br>
> <br>
> return 0;<br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
> index 690ef3d..208da11 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c<br>
> @@ -40,7 +40,7 @@ static void
amdgpu_job_timedout(struct amd_sched_job *s_job)<br>
> }<br>
> <br>
> int amdgpu_job_alloc(struct amdgpu_device *adev,
unsigned num_ibs,<br>
> - struct amdgpu_job **job, struct
amdgpu_vm *vm)<br>
> + struct amdgpu_job **job, struct
amdgpu_vm *vm, struct <br>
> +amdgpu_ctx *ctx)<br>
> {<br>
> size_t size = sizeof(struct amdgpu_job);<br>
> <br>
> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct
amdgpu_device *adev, unsigned num_ibs,<br>
> (*job)->vm = vm;<br>
> (*job)->ibs = (void *)&(*job)[1];<br>
> (*job)->num_ibs = num_ibs;<br>
> + (*job)->ctx = amdgpu_ctx_kref_get(ctx);<br>
> <br>
> amdgpu_sync_create(&(*job)->sync);<br>
> <br>
> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct
amdgpu_device *adev, unsigned size,<br>
> {<br>
> int r;<br>
> <br>
> - r = amdgpu_job_alloc(adev, 1, job, NULL);<br>
> + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);<br>
> if (r)<br>
> return r;<br>
> <br>
> @@ -94,6 +95,10 @@ void
amdgpu_job_free_resources(struct amdgpu_job *job)<br>
> static void amdgpu_job_free_cb(struct amd_sched_job
*s_job)<br>
> {<br>
> struct amdgpu_job *job = container_of(s_job,
struct amdgpu_job, <br>
> base);<br>
> + struct amdgpu_ctx *ctx = job->ctx;<br>
> +<br>
> + if (ctx)<br>
> + amdgpu_ctx_put(ctx);<br>
> <br>
> fence_put(job->fence);<br>
> amdgpu_sync_free(&job->sync);<br>
> diff --git
a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c<br>
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c<br>
> index 6f4e31f..9100ca8 100644<br>
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c<br>
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c<br>
> @@ -208,12 +208,6 @@ void
amd_sched_entity_fini(struct amd_gpu_scheduler *sched,<br>
> if (!amd_sched_entity_is_initialized(sched,
entity))<br>
> return;<br>
> <br>
> - /**<br>
> - * The client will not queue more IBs during
this fini, consume existing<br>
> - * queued IBs<br>
> - */<br>
> - wait_event(sched->job_scheduled,
amd_sched_entity_is_idle(entity));<br>
> -<br>
> amd_sched_rq_remove_entity(rq, entity);<br>
> kfifo_free(&entity->job_queue);<br>
> }<br>
> diff --git
a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h<br>
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h<br>
> index 8cb41d3..ccbbcb0 100644<br>
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h<br>
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h<br>
> @@ -49,6 +49,7 @@ struct amd_sched_entity {<br>
> <br>
> struct fence *dependency;<br>
> struct fence_cb cb;<br>
> + bool *ptr_guilty;<br>
> };<br>
> <br>
> /**<br>
<br>
<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a><br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a><br>
</div>
</span></font> </blockquote>
<br>
</blockquote>
<p><br>
</p>
</body>
</html>