[PATCH 2/3] drm/amdgpu: drop the sched_sync

Mon Nov 5 14:21:55 UTC 2018

> Anyway I think the cleanest approach to always handle that correctly would be to always insert a vm flush before all jobs on resubmission. 
That is most likely better for VM flush handling as well.

Yeah, that’s true and more simple 

/Monk
-----Original Message-----
From: Koenig, Christian 
Sent: Monday, November 5, 2018 9:59 PM
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Zhou, David(ChunMing) <David1.Zhou at amd.com>
Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync

> and later its VMID's "current_gpu_reset_count" is updated to "adev->gpu_reset_count"
The question is how much later that is done. My recollection is that we don't reset that for resubmission, but that could be wrong.

Anyway I think the cleanest approach to always handle that correctly would be to always insert a vm flush before all jobs on resubmission. 
That is most likely better for VM flush handling as well.

Christian.

Am 05.11.18 um 14:41 schrieb Liu, Monk:
> Hi Christian
>
> For scenario: Bad Job (hang, vmid1) -->Job A (context 10, explicit dep 
> for Job B, vmid2) --> Job B(context 10, vmid2) --> Job C (context 11, 
> vmid3)
>
> Assume "job_hang_limit" is 0, and assume "sched_hw_submission" is 4, I give a second thought on the logic after GPU reset:
>
> 1) the bad Job would be set guilty and skipped by scheduler,
> 2) the first re-submitted job (Job A) would be forced with a 
> pipeline-sync,
> 3) the first re-submitted job (Job A) would be forced with a vm-flush, and later its VMID's "current_gpu_reset_count" is updated to "adev->gpu_reset_count"
> 4) the second re-submitted job (Job B, assume it was from the same context of Job A, share the same page table/process, and no vm_update needed ) would not be forced with a pipeline-sync, and neither a vm-flush ...
>   
> Thus for Job B if it has an explicit dep on Job A, this explicit dep would get lost and there will be no pipeline sync inserted prior to Job B ...
>
> Do you think that's a possible corner case ?
>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Monday, November 5, 2018 3:48 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Zhou, 
> David(ChunMing) <David1.Zhou at amd.com>
> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>
> Am 05.11.18 um 08:24 schrieb Liu, Monk:
>>> David Zhou had an use case which saw a >10% performance drop the last time he tried it.
>> I really don't believe that, because if you insert a WAIT_MEM on an already signaled fence, it only cost GPU couple clocks to move  on, right ? no reason to slow down up to 10% ... with 3dmark vulkan version test, the performance is barely different ... with my patch applied ...
> Why do you think that we insert a WAIT_MEM on an already signaled fence?
> The pipeline sync always wait for the last fence value (because we can't handle wraparounds in PM4).
>
> So you have a pipeline sync when you don't need one and that is really really bad for things shared between processes, e.g. X/Wayland and it's clients.
>
> I also expects that this doesn't effect 3dmark at all, but everything which runs in a window which is composed by X could be slowed down massively.
>
> David do you remember which use case was affected when you tried to drop this optimization?
>
>>> When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync.
>> I will check that point in my branch, I didn't use drm-next, maybe 
>> there is gap in this part
> We had that logic for a very long time now, but we recently simplified it. Could be that there was a bug introduced doing so.
>
> Maybe we should add a specific flag to run_job() to note that we are re-running a job and then always add VM flushes/pipeline syncs?
>
> But my main question is why do you see any issues with quark? That is a workaround for an issue for Vulkan sync handling and should only surface when a specific test is run many many times.
>
> Regards,
> Christian.
>
>> /Monk
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Monday, November 5, 2018 3:02 AM
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>>
>>> Can you tell me which game/benchmark will have performance drop with this fix by your understanding ?
>> When you sync between submission things like composing X windows are slowed down massively.
>>
>> David Zhou had an use case which saw a >10% performance drop the last time he tried it.
>>
>>> The problem I hit is during the massive stress test against 
>>> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost.
>> Well that is really strange. This workaround is only for a very specific Vulkan CTS test which we are still not 100% sure is actually valid.
>>
>> When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync.
>>
>> So you should never ever run into any issues in quark with that, even when we completely disable this workaround.
>>
>> Regards,
>> Christian.
>>
>> Am 04.11.18 um 01:48 schrieb Liu, Monk:
>>>> NAK, that would result in a severe performance drop.
>>>> We need the fence here to determine if we actually need to do the pipeline sync or not.
>>>> E.g. the explicit requested fence could already be signaled.
>>> For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ...  At least I didn't observe any performance drop with 3dmark benchmark (also tested vulkan CTS), Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? let me check it .
>>>
>>> The problem I hit is during the massive stress test against 
>>> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost.
>>>
>>>
>>> BTW: for original logic, the pipeline sync have another corner case:
>>> Assume JobC depend on JobA with explicit flag, and there is jobB inserted in ring:
>>>
>>> jobA -> jobB -> (pipe sync)JobC
>>>
>>> if JobA really cost a lot of time to finish, in the
>>> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against its explicit dependency which is JobA, but there is a JobB between A and C and the pipeline sync of before JobC will wrongly wait on the JobB ...
>>>
>>> while it is not a big issue but obviously not necessary: C have no 
>>> relation with B
>>>
>>> /Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>> Sent: Sunday, November 4, 2018 3:50 AM
>>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>>>
>>> Am 03.11.18 um 06:33 schrieb Monk Liu:
>>>> Reasons to drop it:
>>>>
>>>> 1) simplify the code: just introduce field member "need_pipe_sync"
>>>> for job is good enough to tell if the explicit dependency fence 
>>>> need followed by a pipeline sync.
>>>>
>>>> 2) after GPU_recover the explicit fence from sched_syn will not 
>>>> come back so the required pipeline_sync following it is missed, 
>>>> consider scenario below:
>>>>> now on ring buffer:
>>>> Job-A -> pipe_sync -> Job-B
>>>>> TDR occured on Job-A, and after GPU recover:
>>>>> now on ring buffer:
>>>> Job-A -> Job-B
>>>>
>>>> because the fence from sched_sync is used and freed after 
>>>> ib_schedule in first time, it will never come back, with this patch 
>>>> this issue could be avoided.
>>> NAK, that would result in a severe performance drop.
>>>
>>> We need the fence here to determine if we actually need to do the pipeline sync or not.
>>>
>>> E.g. the explicit requested fence could already be signaled.
>>>
>>> Christian.
>>>
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  | 16 ++++++----------
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++-----------
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  3 +--
>>>>      3 files changed, 10 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> index c48207b3..ac7d2da 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>>>>      {
>>>>      	struct amdgpu_device *adev = ring->adev;
>>>>      	struct amdgpu_ib *ib = &ibs[0];
>>>> -	struct dma_fence *tmp = NULL;
>>>>      	bool skip_preamble, need_ctx_switch;
>>>>      	unsigned patch_offset = ~0;
>>>>      	struct amdgpu_vm *vm;
>>>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>>>>      	}
>>>>      
>>>>      	need_ctx_switch = ring->current_ctx != fence_ctx;
>>>> -	if (ring->funcs->emit_pipeline_sync && job &&
>>>> -	    ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) ||
>>>> -	     (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
>>>> -	     amdgpu_vm_need_pipeline_sync(ring, job))) {
>>>> -		need_pipe_sync = true;
>>>>      
>>>> -		if (tmp)
>>>> -			trace_amdgpu_ib_pipe_sync(job, tmp);
>>>> -
>>>> -		dma_fence_put(tmp);
>>>> +	if (ring->funcs->emit_pipeline_sync && job) {
>>>> +		if ((need_ctx_switch && amdgpu_sriov_vf(adev)) ||
>>>> +			amdgpu_vm_need_pipeline_sync(ring, job))
>>>> +			need_pipe_sync = true;
>>>> +		else if (job->need_pipe_sync)
>>>> +			need_pipe_sync = true;
>>>>      	}
>>>>      
>>>>      	if (ring->funcs->insert_start) diff --git 
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 1d71f8c..dae997d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>>      	(*job)->num_ibs = num_ibs;
>>>>      
>>>>      	amdgpu_sync_create(&(*job)->sync);
>>>> -	amdgpu_sync_create(&(*job)->sched_sync);
>>>>      	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>>>>      	(*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>>>>      
>>>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>>>      	amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>      	dma_fence_put(job->fence);
>>>>      	amdgpu_sync_free(&job->sync);
>>>> -	amdgpu_sync_free(&job->sched_sync);
>>>>      	kfree(job);
>>>>      }
>>>>      
>>>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>      
>>>>      	dma_fence_put(job->fence);
>>>>      	amdgpu_sync_free(&job->sync);
>>>> -	amdgpu_sync_free(&job->sched_sync);
>>>>      	kfree(job);
>>>>      }
>>>>      
>>>> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
>>>>      	bool need_pipe_sync = false;
>>>>      	int r;
>>>>      
>>>> -	fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync);
>>>> -	if (fence && need_pipe_sync) {
>>>> -		if (drm_sched_dependency_optimized(fence, s_entity)) {
>>>> -			r = amdgpu_sync_fence(ring->adev, &job->sched_sync,
>>>> -					      fence, false);
>>>> -			if (r)
>>>> -				DRM_ERROR("Error adding fence (%d)\n", r);
>>>> -		}
>>>> +	if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) {
>>>> +		trace_amdgpu_ib_pipe_sync(job, fence);
>>>> +		job->need_pipe_sync = true;
>>>>      	}
>>>>      
>>>>      	while (fence == NULL && vm && !job->vmid) { diff --git 
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> index e1b46a6..c1d00f0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> @@ -41,7 +41,6 @@ struct amdgpu_job {
>>>>      	struct drm_sched_job    base;
>>>>      	struct amdgpu_vm	*vm;
>>>>      	struct amdgpu_sync	sync;
>>>> -	struct amdgpu_sync	sched_sync;
>>>>      	struct amdgpu_ib	*ibs;
>>>>      	struct dma_fence	*fence; /* the hw fence */
>>>>      	uint32_t		preamble_status;
>>>> @@ -59,7 +58,7 @@ struct amdgpu_job {
>>>>      	/* user fence handling */
>>>>      	uint64_t		uf_addr;
>>>>      	uint64_t		uf_sequence;
>>>> -
>>>> +	bool            need_pipe_sync; /* require a pipeline sync for this job */
>>>>      };
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>>> num_ibs,