[PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
zhoucm1
david1.zhou at amd.com
Wed May 3 09:29:41 UTC 2017
Out of this title, it let me think job->vm isn't safe as well, vm could
be freed when it is being used amdgpu_ib_schedule, do you have any
thoughts to solve it?
Regards,
David Zhou
On 2017年05月03日 17:18, Christian König wrote:
>>> I'm afraid not: CSA is gone with the VM, and VM is gone after app
>>> close our FD, I don't see amdgpu_vm_fini() is depended on context
>>> living or not ...
>
> See the teardown order in amdgpu_driver_postclose_kms():
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>
>> amdgpu_uvd_free_handles(adev, file_priv);
>> amdgpu_vce_free_handles(adev, file_priv);
>>
>> amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>> if (amdgpu_sriov_vf(adev)) {
>> /* TODO: how to handle reserve failure */
>> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>> amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>> fpriv->vm.csa_bo_va = NULL;
>> amdgpu_bo_unreserve(adev->virt.csa_obj);
>> }
>>
>> amdgpu_vm_fini(adev, &fpriv->vm);
>
> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all
> contexts of the current fd.
>
> If we don't release the context here because some jobs are still
> executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA
> and even the whole VM structure alive.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give
>> me your detail idea
> Please take a look at David's idea of using the fence_context to find
> which jobs and entity to skip, that is even better than mine about the
> fence status and should be trivial to implement because all the data
> is already present we just need to use it.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>> 1,My idea is that userspace should rather gather the feedback during
>> the next command submission. This has the advantage that you don't
>> need to keep userspace alive till all jobs are done.
>>
>>> No, we need to clean the hw ring (cherry-pick out guilty entities'
>>> job in all rings) after gpu reset, and we need fackly signal all
>>> sched_fence in the guity entity as well, and we need mark context as
>>> guilty so the next IOCTL on it will return -ENODEV.
>>> I don't understand how your idea can solve my request ...
>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you
>> don't tear down the ctx immediately.
>>
>>> I'm afraid not: CSA is gone with the VM, and VM is gone after app
>>> close our FD, I don't see amdgpu_vm_fini() is depended on context
>>> living or not ...
>> 3, struct fence was renamed to struct dma_fence on newer kernels and
>> a status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give
>>> me your detail idea
>>
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple at vodafone.de]
>> Sent: Wednesday, May 03, 2017 4:59 PM
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>>
>>> 1, This is necessary otherwise how can I access entity pointer after a
>>> job timedout
>> No that isn't necessary.
>>
>> The problem with your idea is that you want to actively push the
>> feedback/status from the job execution back to userspace when an error
>> (timeout) happens.
>>
>> My idea is that userspace should rather gather the feedback during
>> the next command submission. This has the advantage that you don't
>> need to keep userspace alive till all jobs are done.
>>
>>> , and why it is dangerous ?
>> You need to keep quite a bunch of stuff alive (VM, CSA) when you
>> don't tear down the ctx immediately.
>>
>> We could split ctx tear down into freeing the resources and freeing
>> the structure, but I think just gathering the information needed on
>> CS is easier to do.
>>
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>> struct fence was renamed to struct dma_fence on newer kernels and a
>> status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>> 1, This is necessary otherwise how can I access entity pointer after
>>> a job timedout , and why it is dangerous ?
>>> 2, what's the status field in the fences you were referring to ? I
>>> need to judge if it could satisfy my requirement
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple at vodafone.de]
>>> Sent: Monday, May 01, 2017 10:48 PM
>>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job
>>> finished
>>>
>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>> for TDR guilty context feature, we need access ctx/s_entity field
>>>> member through sched job pointer,so ctx must keep alive till all job
>>>> from it signaled.
>>> NAK, that is unnecessary and quite dangerous.
>>>
>>> Instead we have the designed status field in the fences which should
>>> be checked for that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++--
>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>> 6 files changed, 23 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index e330009..8e031d6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>> uint32_t flags;
>>>> };
>>>> +struct amdgpu_ctx;
>>>> +
>>>> extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>>>> num_ibs,
>>>> - struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx);
>>>> int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>>> unsigned size,
>>>> struct amdgpu_job **job);
>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>> struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv
>>>> *fpriv, uint32_t id);
>>>> int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>> uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx,
>>>> struct amdgpu_ring *ring,
>>>> struct fence *fence);
>>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>>>> struct amdgpu_sync sync;
>>>> struct amdgpu_ib *ibs;
>>>> struct fence *fence; /* the hw fence */
>>>> + struct amdgpu_ctx *ctx;
>>>> uint32_t preamble_status;
>>>> uint32_t num_ibs;
>>>> void *owner;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 699f5fe..267fb65 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
>>>> amdgpu_cs_parser *p, void *data)
>>>> }
>>>> }
>>>> - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>> + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>> if (ret)
>>>> goto free_all_kdata;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index b4bbbb3..81438af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -25,6 +25,13 @@
>>>> #include <drm/drmP.h>
>>>> #include "amdgpu.h"
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>> + if (ctx)
>>>> + kref_get(&ctx->refcount);
>>>> + return ctx;
>>>> +}
>>>> +
>>>> static int amdgpu_ctx_init(struct amdgpu_device *adev, struct
>>>> amdgpu_ctx *ctx)
>>>> {
>>>> unsigned i, j;
>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device
>>>> *adev, struct amdgpu_ctx *ctx)
>>>> rq, amdgpu_sched_jobs);
>>>> if (r)
>>>> goto failed;
>>>> +
>>>> + ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel
>>>> entity
>>>> +doesn't have ptr_guilty */
>>>> }
>>>> return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..208da11 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct
>>>> amd_sched_job *s_job)
>>>> }
>>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>>>> num_ibs,
>>>> - struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct
>>>> +amdgpu_ctx *ctx)
>>>> {
>>>> size_t size = sizeof(struct amdgpu_job);
>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device
>>>> *adev, unsigned num_ibs,
>>>> (*job)->vm = vm;
>>>> (*job)->ibs = (void *)&(*job)[1];
>>>> (*job)->num_ibs = num_ibs;
>>>> + (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>> amdgpu_sync_create(&(*job)->sync);
>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct
>>>> amdgpu_device *adev, unsigned size,
>>>> {
>>>> int r;
>>>> - r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>> + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>> if (r)
>>>> return r;
>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct
>>>> amdgpu_job *job)
>>>> static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>> {
>>>> struct amdgpu_job *job = container_of(s_job, struct
>>>> amdgpu_job,
>>>> base);
>>>> + struct amdgpu_ctx *ctx = job->ctx;
>>>> +
>>>> + if (ctx)
>>>> + amdgpu_ctx_put(ctx);
>>>> fence_put(job->fence);
>>>> amdgpu_sync_free(&job->sync);
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 6f4e31f..9100ca8 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct
>>>> amd_gpu_scheduler *sched,
>>>> if (!amd_sched_entity_is_initialized(sched, entity))
>>>> return;
>>>> - /**
>>>> - * The client will not queue more IBs during this fini,
>>>> consume existing
>>>> - * queued IBs
>>>> - */
>>>> - wait_event(sched->job_scheduled,
>>>> amd_sched_entity_is_idle(entity));
>>>> -
>>>> amd_sched_rq_remove_entity(rq, entity);
>>>> kfifo_free(&entity->job_queue);
>>>> }
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> index 8cb41d3..ccbbcb0 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>> struct fence *dependency;
>>>> struct fence_cb cb;
>>>> + bool *ptr_guilty;
>>>> };
>>>> /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list