[PATCH 2/3] drm/amdgpu: drop the sched_sync
Liu, Monk
Monk.Liu at amd.com
Mon Nov 5 07:24:07 UTC 2018
> David Zhou had an use case which saw a >10% performance drop the last time he tried it.
I really don't believe that, because if you insert a WAIT_MEM on an already signaled fence, it only cost GPU couple clocks to move on, right ? no reason to slow down up to 10% ... with 3dmark vulkan version test, the performance is barely different ... with my patch applied ...
> When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync.
I will check that point in my branch, I didn't use drm-next, maybe there is gap in this part
/Monk
-----Original Message-----
From: Koenig, Christian
Sent: Monday, November 5, 2018 3:02 AM
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
> Can you tell me which game/benchmark will have performance drop with this fix by your understanding ?
When you sync between submission things like composing X windows are slowed down massively.
David Zhou had an use case which saw a >10% performance drop the last time he tried it.
> The problem I hit is during the massive stress test against
> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost.
Well that is really strange. This workaround is only for a very specific Vulkan CTS test which we are still not 100% sure is actually valid.
When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync.
So you should never ever run into any issues in quark with that, even when we completely disable this workaround.
Regards,
Christian.
Am 04.11.18 um 01:48 schrieb Liu, Monk:
>> NAK, that would result in a severe performance drop.
>> We need the fence here to determine if we actually need to do the pipeline sync or not.
>> E.g. the explicit requested fence could already be signaled.
> For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ... At least I didn't observe any performance drop with 3dmark benchmark (also tested vulkan CTS), Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? let me check it .
>
> The problem I hit is during the massive stress test against
> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost.
>
>
> BTW: for original logic, the pipeline sync have another corner case:
> Assume JobC depend on JobA with explicit flag, and there is jobB inserted in ring:
>
> jobA -> jobB -> (pipe sync)JobC
>
> if JobA really cost a lot of time to finish, in the
> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against its explicit dependency which is JobA, but there is a JobB between A and C and the pipeline sync of before JobC will wrongly wait on the JobB ...
>
> while it is not a big issue but obviously not necessary: C have no
> relation with B
>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Sunday, November 4, 2018 3:50 AM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>
> Am 03.11.18 um 06:33 schrieb Monk Liu:
>> Reasons to drop it:
>>
>> 1) simplify the code: just introduce field member "need_pipe_sync"
>> for job is good enough to tell if the explicit dependency fence need
>> followed by a pipeline sync.
>>
>> 2) after GPU_recover the explicit fence from sched_syn will not come
>> back so the required pipeline_sync following it is missed, consider
>> scenario below:
>>> now on ring buffer:
>> Job-A -> pipe_sync -> Job-B
>>> TDR occured on Job-A, and after GPU recover:
>>> now on ring buffer:
>> Job-A -> Job-B
>>
>> because the fence from sched_sync is used and freed after ib_schedule
>> in first time, it will never come back, with this patch this issue
>> could be avoided.
> NAK, that would result in a severe performance drop.
>
> We need the fence here to determine if we actually need to do the pipeline sync or not.
>
> E.g. the explicit requested fence could already be signaled.
>
> Christian.
>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 16 ++++++----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++-----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +--
>> 3 files changed, 10 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> index c48207b3..ac7d2da 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>> {
>> struct amdgpu_device *adev = ring->adev;
>> struct amdgpu_ib *ib = &ibs[0];
>> - struct dma_fence *tmp = NULL;
>> bool skip_preamble, need_ctx_switch;
>> unsigned patch_offset = ~0;
>> struct amdgpu_vm *vm;
>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>> }
>>
>> need_ctx_switch = ring->current_ctx != fence_ctx;
>> - if (ring->funcs->emit_pipeline_sync && job &&
>> - ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) ||
>> - (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
>> - amdgpu_vm_need_pipeline_sync(ring, job))) {
>> - need_pipe_sync = true;
>>
>> - if (tmp)
>> - trace_amdgpu_ib_pipe_sync(job, tmp);
>> -
>> - dma_fence_put(tmp);
>> + if (ring->funcs->emit_pipeline_sync && job) {
>> + if ((need_ctx_switch && amdgpu_sriov_vf(adev)) ||
>> + amdgpu_vm_need_pipeline_sync(ring, job))
>> + need_pipe_sync = true;
>> + else if (job->need_pipe_sync)
>> + need_pipe_sync = true;
>> }
>>
>> if (ring->funcs->insert_start)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1d71f8c..dae997d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> (*job)->num_ibs = num_ibs;
>>
>> amdgpu_sync_create(&(*job)->sync);
>> - amdgpu_sync_create(&(*job)->sched_sync);
>> (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>> (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>>
>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>> dma_fence_put(job->fence);
>> amdgpu_sync_free(&job->sync);
>> - amdgpu_sync_free(&job->sched_sync);
>> kfree(job);
>> }
>>
>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>
>> dma_fence_put(job->fence);
>> amdgpu_sync_free(&job->sync);
>> - amdgpu_sync_free(&job->sched_sync);
>> kfree(job);
>> }
>>
>> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job,
>> bool need_pipe_sync = false;
>> int r;
>>
>> - fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync);
>> - if (fence && need_pipe_sync) {
>> - if (drm_sched_dependency_optimized(fence, s_entity)) {
>> - r = amdgpu_sync_fence(ring->adev, &job->sched_sync,
>> - fence, false);
>> - if (r)
>> - DRM_ERROR("Error adding fence (%d)\n", r);
>> - }
>> + if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) {
>> + trace_amdgpu_ib_pipe_sync(job, fence);
>> + job->need_pipe_sync = true;
>> }
>>
>> while (fence == NULL && vm && !job->vmid) { diff --git
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> index e1b46a6..c1d00f0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> @@ -41,7 +41,6 @@ struct amdgpu_job {
>> struct drm_sched_job base;
>> struct amdgpu_vm *vm;
>> struct amdgpu_sync sync;
>> - struct amdgpu_sync sched_sync;
>> struct amdgpu_ib *ibs;
>> struct dma_fence *fence; /* the hw fence */
>> uint32_t preamble_status;
>> @@ -59,7 +58,7 @@ struct amdgpu_job {
>> /* user fence handling */
>> uint64_t uf_addr;
>> uint64_t uf_sequence;
>> -
>> + bool need_pipe_sync; /* require a pipeline sync for this job */
>> };
>>
>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
More information about the amd-gfx
mailing list