[PATCH 2/3] drm/amdgpu: drop the sched_sync

Sun Nov 4 00:48:13 UTC 2018

> NAK, that would result in a severe performance drop.
> We need the fence here to determine if we actually need to do the pipeline sync or not.
>E.g. the explicit requested fence could already be signaled.

For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ...  At least I didn't observe any performance drop with 3dmark benchmark (also tested vulkan CTS), Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? let me check it .

The problem I hit is during the massive stress test against multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, 
After the TDR these two job will lose the explicit and the pipeline-sync was also lost.

BTW: for original logic, the pipeline sync have another corner case:
Assume JobC depend on JobA with explicit flag, and there is jobB inserted in ring:

jobA -> jobB -> (pipe sync)JobC

if JobA really cost a lot of time to finish, in the amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against its explicit dependency which is JobA,
but there is a JobB between A and C and the pipeline sync of before JobC will wrongly wait on the JobB ...  

while it is not a big issue but obviously not necessary: C have no relation with B

/Monk

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com> 
Sent: Sunday, November 4, 2018 3:50 AM
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync

Am 03.11.18 um 06:33 schrieb Monk Liu:
> Reasons to drop it:
>
> 1) simplify the code: just introduce field member "need_pipe_sync"
> for job is good enough to tell if the explicit dependency fence need 
> followed by a pipeline sync.
>
> 2) after GPU_recover the explicit fence from sched_syn will not come 
> back so the required pipeline_sync following it is missed, consider 
> scenario below:
>> now on ring buffer:
> Job-A -> pipe_sync -> Job-B
>> TDR occured on Job-A, and after GPU recover:
>> now on ring buffer:
> Job-A -> Job-B
>
> because the fence from sched_sync is used and freed after ib_schedule 
> in first time, it will never come back, with this patch this issue 
> could be avoided.

NAK, that would result in a severe performance drop.

We need the fence here to determine if we actually need to do the pipeline sync or not.

E.g. the explicit requested fence could already be signaled.

Christian.

>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  | 16 ++++++----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++-----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  3 +--
>   3 files changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index c48207b3..ac7d2da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>   {
>   	struct amdgpu_device *adev = ring->adev;
>   	struct amdgpu_ib *ib = &ibs[0];
> -	struct dma_fence *tmp = NULL;
>   	bool skip_preamble, need_ctx_switch;
>   	unsigned patch_offset = ~0;
>   	struct amdgpu_vm *vm;
> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>   	}
>   
>   	need_ctx_switch = ring->current_ctx != fence_ctx;
> -	if (ring->funcs->emit_pipeline_sync && job &&
> -	    ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) ||
> -	     (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
> -	     amdgpu_vm_need_pipeline_sync(ring, job))) {
> -		need_pipe_sync = true;
>   
> -		if (tmp)
> -			trace_amdgpu_ib_pipe_sync(job, tmp);
> -
> -		dma_fence_put(tmp);
> +	if (ring->funcs->emit_pipeline_sync && job) {
> +		if ((need_ctx_switch && amdgpu_sriov_vf(adev)) ||
> +			amdgpu_vm_need_pipeline_sync(ring, job))
> +			need_pipe_sync = true;
> +		else if (job->need_pipe_sync)
> +			need_pipe_sync = true;
>   	}
>   
>   	if (ring->funcs->insert_start)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1d71f8c..dae997d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>   	(*job)->num_ibs = num_ibs;
>   
>   	amdgpu_sync_create(&(*job)->sync);
> -	amdgpu_sync_create(&(*job)->sched_sync);
>   	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>   	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>   
> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>   	amdgpu_ring_priority_put(ring, s_job->s_priority);
>   	dma_fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> -	amdgpu_sync_free(&job->sched_sync);
>   	kfree(job);
>   }
>   
> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>   
>   	dma_fence_put(job->fence);
>   	amdgpu_sync_free(&job->sync);
> -	amdgpu_sync_free(&job->sched_sync);
>   	kfree(job);
>   }
>   
> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
>   	bool need_pipe_sync = false;
>   	int r;
>   
> -	fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync);
> -	if (fence && need_pipe_sync) {
> -		if (drm_sched_dependency_optimized(fence, s_entity)) {
> -			r = amdgpu_sync_fence(ring->adev, &job->sched_sync,
> -					      fence, false);
> -			if (r)
> -				DRM_ERROR("Error adding fence (%d)\n", r);
> -		}
> +	if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) {
> +		trace_amdgpu_ib_pipe_sync(job, fence);
> +		job->need_pipe_sync = true;
>   	}
>   
>   	while (fence == NULL && vm && !job->vmid) { diff --git 
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index e1b46a6..c1d00f0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -41,7 +41,6 @@ struct amdgpu_job {
>   	struct drm_sched_job    base;
>   	struct amdgpu_vm	*vm;
>   	struct amdgpu_sync	sync;
> -	struct amdgpu_sync	sched_sync;
>   	struct amdgpu_ib	*ibs;
>   	struct dma_fence	*fence; /* the hw fence */
>   	uint32_t		preamble_status;
> @@ -59,7 +58,7 @@ struct amdgpu_job {
>   	/* user fence handling */
>   	uint64_t		uf_addr;
>   	uint64_t		uf_sequence;
> -
> +	bool            need_pipe_sync; /* require a pipeline sync for this job */
>   };
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,