[PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

Tue Sep 12 03:47:27 UTC 2017

> The order of job->base.s_fence and preventing concurrent CS was guaranteed by the VM PD being reserved, that is now no longer the case and we need a new lock for protection.
Thought of this either, but since UMD/APP submits their jobs in one thread upon one ctx (if submits jobs within multiple thread upon one ctx that means the order is not important for UMD/app), so after we unreserved PD's resv lock 
We still only have one thread doing the cs_ioctl upon *current ctx*, 

So how that not work now ? thanks !

BR Monk 

-----Original Message-----
From: Christian König [mailto:deathsimple at vodafone.de] 
Sent: 2017年9月11日 17:13
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

Am 11.09.2017 um 11:02 schrieb Monk Liu:
> need to unreserve ttm bo before "cs_add_fence" and "entity_push_job"
> otherwise there will be deadlock between "recover_vram_from_shadow"
> and previous two routines on the ttm bo's resv lock.
>
> Change-Id: I9c18b677e474ce5b45a9cc24da3fa255d77b3d44
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>

Yeah, I was thinking about this as well. Problem is this won't work without further changes.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 35 ++++++++++++++++++----------------
>   1 file changed, 19 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index c1f1ae9..823ac12 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -736,26 +736,12 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>   /**
>    * cs_parser_fini() - clean parser states
>    * @parser:	parser structure holding parsing context.
> - * @error:	error number
> - *
> - * If error is set than unvalidate buffer, otherwise just free memory
> - * used by parsing context.
>    **/
> -static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser, 
> int error, bool backoff)
> +static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser)
>   {
>   	struct amdgpu_fpriv *fpriv = parser->filp->driver_priv;
>   	unsigned i;
>   
> -	if (!error) {
> -		amdgpu_vm_move_pt_bos_in_lru(parser->adev, &fpriv->vm);
> -
> -		ttm_eu_fence_buffer_objects(&parser->ticket,
> -					    &parser->validated,
> -					    parser->fence);
> -	} else if (backoff) {
> -		ttm_eu_backoff_reservation(&parser->ticket,
> -					   &parser->validated);
> -	}
>   	fence_put(parser->fence);
>   
>   	if (parser->ctx)
> @@ -1075,6 +1061,16 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   	job->owner = p->filp;
>   	job->fence_ctx = entity->fence_context;
>   	p->fence = fence_get(&job->base.s_fence->finished);
> +
> +	/* hook sched fence to all BOs' reservation in validated list
> +	 * and unreserve them.
> +	 *
> +	 * we unreserve at here is because otherwise
> +	 * there'll be deadlock between ctx_add_fence/sched_entity_push_job
> +	 * and gpu_reset routine's recover_bo_from_shadow on PD/PTEs' ttm bo lock
> +	 */
> +	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);

The order of job->base.s_fence and preventing concurrent CS was guaranteed by the VM PD being reserved, that is now no longer the case and we need a new lock for protection.

I suggest to add another lock to the context which we grab in
amdgpu_ctx_get() and release in amdgpu_ctx_put().

Regards,
Christian.

> +
>   	r = amdgpu_ctx_add_fence(p->ctx, ring, p->fence, &seq);
>   	if (r) {
>   		fence_put(p->fence);
> @@ -1095,6 +1091,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   {
>   	struct amdgpu_device *adev = dev->dev_private;
>   	union drm_amdgpu_cs *cs = data;
> +	struct amdgpu_fpriv *fpriv;
>   	struct amdgpu_cs_parser parser = {};
>   	bool reserved_buffers = false;
>   	int i, r;
> @@ -1104,6 +1101,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void 
> *data, struct drm_file *filp)
>   
>   	parser.adev = adev;
>   	parser.filp = filp;
> +	fpriv = filp->driver_priv;
>   
>   	r = amdgpu_cs_parser_init(&parser, data);
>   	if (r) {
> @@ -1141,7 +1139,12 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   	r = amdgpu_cs_submit(&parser, cs);
>   
>   out:
> -	amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
> +	if (!r)
> +		amdgpu_vm_move_pt_bos_in_lru(parser.adev, &fpriv->vm);
> +	else if (reserved_buffers)
> +		ttm_eu_backoff_reservation(&parser.ticket, &parser.validated);
> +
> +	amdgpu_cs_parser_fini(&parser);
>   	return r;
>   }
>