[PATCH 2/3] drm/amdgpu: drop the sched_sync

Mon Nov 5 07:50:50 UTC 2018

> -----Original Message-----
> From: Koenig, Christian
> Sent: Monday, November 05, 2018 3:48 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Zhou,
> David(ChunMing) <David1.Zhou at amd.com>
> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
> 
> Am 05.11.18 um 08:24 schrieb Liu, Monk:
> >> David Zhou had an use case which saw a >10% performance drop the last
> time he tried it.
> > I really don't believe that, because if you insert a WAIT_MEM on an already
> signaled fence, it only cost GPU couple clocks to move  on, right ? no reason
> to slow down up to 10% ... with 3dmark vulkan version test, the performance
> is barely different ... with my patch applied ...
> 
> Why do you think that we insert a WAIT_MEM on an already signaled fence?
> The pipeline sync always wait for the last fence value (because we can't
> handle wraparounds in PM4).
> 
> So you have a pipeline sync when you don't need one and that is really really
> bad for things shared between processes, e.g. X/Wayland and it's clients.
> 
> I also expects that this doesn't effect 3dmark at all, but everything which runs
> in a window which is composed by X could be slowed down massively.
> 
> David do you remember which use case was affected when you tried to drop
> this optimization?
That was a long time ago, I remember Andrey also tried to remove sched_sync before, but he eventually kept it, right?
From Monk's patch, seems he doesn't change main logic, he just  moved sched_sync logic to job->need_pipe_sync.
But at least, I can see a bit effect, e.g. job process evaluates fence to sched_sync, but the fence could be signaled when amdgpu_ib_schedule, then don't need insert pipeline sync.

Anyway, this is a sensitive path, we should change it carefully, we should give a wide test.

Regards,
David Zhou
> 
> >> When a reset happens we flush the VMIDs when re-submitting the jobs
> to the rings and while doing so we also always do a pipeline sync.
> > I will check that point in my branch, I didn't use drm-next, maybe
> > there is gap in this part
> 
> We had that logic for a very long time now, but we recently simplified it.
> Could be that there was a bug introduced doing so.
> 
> Maybe we should add a specific flag to run_job() to note that we are re-
> running a job and then always add VM flushes/pipeline syncs?
> 
> But my main question is why do you see any issues with quark? That is a
> workaround for an issue for Vulkan sync handling and should only surface
> when a specific test is run many many times.
> 
> Regards,
> Christian.
> 
> >
> > /Monk
> > -----Original Message-----
> > From: Koenig, Christian
> > Sent: Monday, November 5, 2018 3:02 AM
> > To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> > Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
> >
> >> Can you tell me which game/benchmark will have performance drop with
> this fix by your understanding ?
> > When you sync between submission things like composing X windows are
> slowed down massively.
> >
> > David Zhou had an use case which saw a >10% performance drop the last
> time he tried it.
> >
> >> The problem I hit is during the massive stress test against
> >> multi-process + quark , if the quark process hang the engine while there is
> another two job following the bad job, After the TDR these two job will lose
> the explicit and the pipeline-sync was also lost.
> > Well that is really strange. This workaround is only for a very specific Vulkan
> CTS test which we are still not 100% sure is actually valid.
> >
> > When a reset happens we flush the VMIDs when re-submitting the jobs to
> the rings and while doing so we also always do a pipeline sync.
> >
> > So you should never ever run into any issues in quark with that, even when
> we completely disable this workaround.
> >
> > Regards,
> > Christian.
> >
> > Am 04.11.18 um 01:48 schrieb Liu, Monk:
> >>> NAK, that would result in a severe performance drop.
> >>> We need the fence here to determine if we actually need to do the
> pipeline sync or not.
> >>> E.g. the explicit requested fence could already be signaled.
> >> For the performance issue, only insert a WAIT_REG_MEM on
> GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ...  At
> least I didn't observe any performance drop with 3dmark benchmark (also
> tested vulkan CTS), Can you tell me which game/benchmark will have
> performance drop with this fix by your understanding ? let me check it .
> >>
> >> The problem I hit is during the massive stress test against
> >> multi-process + quark , if the quark process hang the engine while there is
> another two job following the bad job, After the TDR these two job will lose
> the explicit and the pipeline-sync was also lost.
> >>
> >>
> >> BTW: for original logic, the pipeline sync have another corner case:
> >> Assume JobC depend on JobA with explicit flag, and there is jobB inserted
> in ring:
> >>
> >> jobA -> jobB -> (pipe sync)JobC
> >>
> >> if JobA really cost a lot of time to finish, in the
> >> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC
> against its explicit dependency which is JobA, but there is a JobB between A
> and C and the pipeline sync of before JobC will wrongly wait on the JobB ...
> >>
> >> while it is not a big issue but obviously not necessary: C have no
> >> relation with B
> >>
> >> /Monk
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> >> Sent: Sunday, November 4, 2018 3:50 AM
> >> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> >> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
> >>
> >> Am 03.11.18 um 06:33 schrieb Monk Liu:
> >>> Reasons to drop it:
> >>>
> >>> 1) simplify the code: just introduce field member "need_pipe_sync"
> >>> for job is good enough to tell if the explicit dependency fence need
> >>> followed by a pipeline sync.
> >>>
> >>> 2) after GPU_recover the explicit fence from sched_syn will not come
> >>> back so the required pipeline_sync following it is missed, consider
> >>> scenario below:
> >>>> now on ring buffer:
> >>> Job-A -> pipe_sync -> Job-B
> >>>> TDR occured on Job-A, and after GPU recover:
> >>>> now on ring buffer:
> >>> Job-A -> Job-B
> >>>
> >>> because the fence from sched_sync is used and freed after
> >>> ib_schedule in first time, it will never come back, with this patch
> >>> this issue could be avoided.
> >> NAK, that would result in a severe performance drop.
> >>
> >> We need the fence here to determine if we actually need to do the
> pipeline sync or not.
> >>
> >> E.g. the explicit requested fence could already be signaled.
> >>
> >> Christian.
> >>
> >>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> >>> ---
> >>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  | 16 ++++++----------
> >>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++-----------
> >>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  3 +--
> >>>     3 files changed, 10 insertions(+), 23 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>> index c48207b3..ac7d2da 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring
> *ring, unsigned num_ibs,
> >>>     {
> >>>     	struct amdgpu_device *adev = ring->adev;
> >>>     	struct amdgpu_ib *ib = &ibs[0];
> >>> -	struct dma_fence *tmp = NULL;
> >>>     	bool skip_preamble, need_ctx_switch;
> >>>     	unsigned patch_offset = ~0;
> >>>     	struct amdgpu_vm *vm;
> >>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring
> *ring, unsigned num_ibs,
> >>>     	}
> >>>
> >>>     	need_ctx_switch = ring->current_ctx != fence_ctx;
> >>> -	if (ring->funcs->emit_pipeline_sync && job &&
> >>> -	    ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) ||
> >>> -	     (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
> >>> -	     amdgpu_vm_need_pipeline_sync(ring, job))) {
> >>> -		need_pipe_sync = true;
> >>>
> >>> -		if (tmp)
> >>> -			trace_amdgpu_ib_pipe_sync(job, tmp);
> >>> -
> >>> -		dma_fence_put(tmp);
> >>> +	if (ring->funcs->emit_pipeline_sync && job) {
> >>> +		if ((need_ctx_switch && amdgpu_sriov_vf(adev)) ||
> >>> +			amdgpu_vm_need_pipeline_sync(ring, job))
> >>> +			need_pipe_sync = true;
> >>> +		else if (job->need_pipe_sync)
> >>> +			need_pipe_sync = true;
> >>>     	}
> >>>
> >>>     	if (ring->funcs->insert_start)
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 1d71f8c..dae997d 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev,
> unsigned num_ibs,
> >>>     	(*job)->num_ibs = num_ibs;
> >>>
> >>>     	amdgpu_sync_create(&(*job)->sync);
> >>> -	amdgpu_sync_create(&(*job)->sched_sync);
> >>>     	(*job)->vram_lost_counter = atomic_read(&adev-
> >vram_lost_counter);
> >>>     	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
> >>>
> >>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct
> drm_sched_job *s_job)
> >>>     	amdgpu_ring_priority_put(ring, s_job->s_priority);
> >>>     	dma_fence_put(job->fence);
> >>>     	amdgpu_sync_free(&job->sync);
> >>> -	amdgpu_sync_free(&job->sched_sync);
> >>>     	kfree(job);
> >>>     }
> >>>
> >>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
> >>>
> >>>     	dma_fence_put(job->fence);
> >>>     	amdgpu_sync_free(&job->sync);
> >>> -	amdgpu_sync_free(&job->sched_sync);
> >>>     	kfree(job);
> >>>     }
> >>>
> >>> @@ -182,14 +179,9 @@ static struct dma_fence
> *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> >>>     	bool need_pipe_sync = false;
> >>>     	int r;
> >>>
> >>> -	fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync);
> >>> -	if (fence && need_pipe_sync) {
> >>> -		if (drm_sched_dependency_optimized(fence, s_entity)) {
> >>> -			r = amdgpu_sync_fence(ring->adev, &job-
> >sched_sync,
> >>> -					      fence, false);
> >>> -			if (r)
> >>> -				DRM_ERROR("Error adding fence (%d)\n", r);
> >>> -		}
> >>> +	if (fence && need_pipe_sync &&
> drm_sched_dependency_optimized(fence, s_entity)) {
> >>> +		trace_amdgpu_ib_pipe_sync(job, fence);
> >>> +		job->need_pipe_sync = true;
> >>>     	}
> >>>
> >>>     	while (fence == NULL && vm && !job->vmid) { diff --git
> >>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>> index e1b46a6..c1d00f0 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>> @@ -41,7 +41,6 @@ struct amdgpu_job {
> >>>     	struct drm_sched_job    base;
> >>>     	struct amdgpu_vm	*vm;
> >>>     	struct amdgpu_sync	sync;
> >>> -	struct amdgpu_sync	sched_sync;
> >>>     	struct amdgpu_ib	*ibs;
> >>>     	struct dma_fence	*fence; /* the hw fence */
> >>>     	uint32_t		preamble_status;
> >>> @@ -59,7 +58,7 @@ struct amdgpu_job {
> >>>     	/* user fence handling */
> >>>     	uint64_t		uf_addr;
> >>>     	uint64_t		uf_sequence;
> >>> -
> >>> +	bool            need_pipe_sync; /* require a pipeline sync for this job */
> >>>     };
> >>>
> >>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
> >>> num_ibs,