[PATCH v2] drm/amd/amdgpu implement tdr advanced mode
Christian König
christian.koenig at amd.com
Tue Mar 9 17:41:30 UTC 2021
Yeah, sorry. I've seen your reply only after I wrote this mail.
I think we can go ahead if you don't have much other concern on this.
Christian.
Am 09.03.21 um 17:48 schrieb Andrey Grodzovsky:
> I we are talking about 'PATCH v3] drm/amd/amdgpu implement tdr
> advanced mode' which was sent yesterday then I already went over it
> and only had 2 cosmetical comments.
>
> Andrey
>
> On 2021-03-09 6:16 a.m., Christian König wrote:
>> Yeah, that are some really good points. I completely agree that we
>> shouldn't do any larger cleanup right now.
>>
>> But I think we still need some more review on this. I most likely
>> won't have enough time to look into this before the weekend.
>>
>> Andrey can you take a look as well?
>>
>> Thanks,
>> Christian.
>>
>> Am 09.03.21 um 08:29 schrieb Liu, Monk:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Christian
>>>
>>> what feasible and practice now is:
>>> 1) we implement the advanced TDR mode in upstream first (so we can
>>> copy the same scheme in our LTS kernel) -- if you want we can avoid
>>> change drm/scheduler part code, but that one is already rejected by
>>> you due to complicated
>>> 2) then we retire the mirror list concept and rework the
>>> drm/scheduler with KFIFO
>>> 3) remove the guilty/karma handling from scheduler
>>>
>>> So I basically agree with you on the spirit of above changes: hide
>>> those AMD internal concept or tricks in vendor's driver part and
>>> make scheduler simple and scalable
>>> But that definitely need a longer design and discussion, so why
>>> don't we focus on our current problems now,
>>> as long as the new change doesn't regress it is still a good change
>>> based on current TDR implements
>>>
>>> I would proposal we only change AMD part code in this time, Jack's
>>> first version patch didn't touch scheduler part, but you stated it
>>> was too complicated and rejected it
>>>
>>> So the allowable revise options is what Jack did in ver2, which need
>>> to introduce a new API in scheduler drm_sched_resubmit_jobs2().
>>>
>>> Hah: --( what do you think ?
>>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Monday, March 8, 2021 3:53 PM
>>> To: Liu, Monk <Monk.Liu at amd.com>; Zhang, Jack (Jian)
>>> <Jack.Zhang1 at amd.com>; amd-gfx at lists.freedesktop.org; Grodzovsky,
>>> Andrey <Andrey.Grodzovsky at amd.com>; Deng, Emily <Emily.Deng at amd.com>
>>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>>>
>>>
>>>
>>> Am 08.03.21 um 05:06 schrieb Liu, Monk:
>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>
>>>>>> well first of all please completely drop the affinity group stuff
>>>>>> from this patch. We should concentrate on one feature at at time.
>>>> We need it to expedite the process, we can introduce this change in
>>>> another patch
>>>>
>>>>
>>>>>> Then the implementation is way to complicate. All you need to do
>>>>>> is insert a dma_fence_wait after re-scheduling each job after a
>>>>>> reset.
>>>> No that's not true, during the " drm_sched_resubmit_jobs" it will put
>>>> all jobs in mirror list into the hw ring, but we can only allow the
>>>> first job to the ring To catch the real guilty one (otherwise it is
>>>> possible that the later job in the ring also has bug and affect our
>>>> judgement) So we need to implement a new "
>>>> drm_sched_resubmit_jobs2()", like this way:
>>> Something like this. But since waiting for the guilty job is AMD
>>> specific we should rather rework the stuff from the beginning.
>>>
>>> What I have in mind is the following:
>>> 1. Add a reference from the scheduler fence back to the job which is
>>> cleared only when the scheduler fence finishes.
>>> 2. Completely drop the ring_mirror_list and replace it with a kfifo
>>> of pointers to the active scheduler fences.
>>> 3. Replace drm_sched_resubmit_jobs with a
>>> drm_sched_for_each_active() macro which allows drivers to iterate
>>> over all the active jobs and resubmit/wait/mark them as guilty etc
>>> etc..
>>> 4. Remove the guilty/karma handling from the scheduler. This is
>>> something AMD specific and shouldn't leak into common code.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> drm_sched_resubmit_jobs2()
>>>> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler
>>>> *sched, int max)
>>>> 500 {
>>>> 501 struct drm_sched_job *s_job, *tmp;
>>>> 502 uint64_t guilty_context;
>>>> 503 bool found_guilty = false;
>>>> 504 struct dma_fence *fence;
>>>> + 505 int i = 0;
>>>> 506
>>>> 507 list_for_each_entry_safe(s_job, tmp,
>>>> &sched->ring_mirror_list, node) {
>>>> 508 struct drm_sched_fence *s_fence = s_job->s_fence;
>>>> 509
>>>> + 510 if (i >= max)
>>>> + 511 break;
>>>> + 512
>>>> 513 if (!found_guilty && atomic_read(&s_job->karma) >
>>>> sched->hang_limit) {
>>>> 514 found_guilty = true;
>>>> 515 guilty_context =
>>>> s_job->s_fence->scheduled.context;
>>>> 516 }
>>>> 517
>>>> 518 if (found_guilty &&
>>>> s_job->s_fence->scheduled.context == guilty_context)
>>>> 519 dma_fence_set_error(&s_fence->finished, -ECANCELED);
>>>> 520
>>>> 521 dma_fence_put(s_job->s_fence->parent);
>>>> 522 fence = sched->ops->run_job(s_job);
>>>> + 523 i++;
>>>> 524
>>>> 525 if (IS_ERR_OR_NULL(fence)) {
>>>> 526 if (IS_ERR(fence))
>>>> 527 dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
>>>> 528
>>>> 529 s_job->s_fence->parent = NULL;
>>>> 530 } else {
>>>> 531 s_job->s_fence->parent = fence;
>>>> 532 }
>>>> 533
>>>> 534
>>>> 535 }
>>>> 536 }
>>>> 537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>> 538
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> ------------------------------------------
>>>> Monk Liu | Cloud-GPU Core team
>>>> ------------------------------------------
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Sunday, March 7, 2021 3:03 AM
>>>> To: Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>;
>>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Deng, Emily
>>>> <Emily.Deng at amd.com>
>>>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>>>>
>>>> Hi Jack,
>>>>
>>>> well first of all please completely drop the affinity group stuff
>>>> from this patch. We should concentrate on one feature at at time.
>>>>
>>>> Then the implementation is way to complicate. All you need to do is
>>>> insert a dma_fence_wait after re-scheduling each job after a reset.
>>>>
>>>> Additional to that this feature is completely AMD specific and
>>>> shouldn't affect the common scheduler in any way.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 06.03.21 um 18:25 schrieb Jack Zhang:
>>>>> [Why]
>>>>> Previous tdr design treats the first job in job_timeout as the bad
>>>>> job.
>>>>> But sometimes a later bad compute job can block a good gfx job and
>>>>> cause an unexpected gfx job timeout because gfx and compute ring
>>>>> share internal GC HW mutually.
>>>>>
>>>>> [How]
>>>>> This patch implements an advanced tdr mode.It involves an additinal
>>>>> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
>>>>> step in order to find the real bad job.
>>>>>
>>>>> 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to
>>>>> guilty and leave it to be handled by the main reset thread.
>>>>>
>>>>> 2. Don't set the job to guilty in pre_asic_reset, and leave it to be
>>>>> handled by Step0 Resubmit Stage.
>>>>>
>>>>> 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously,
>>>>> then it iterate each ring mirror_list, synchronously pend for each hw
>>>>> fence being signaled. If the a job's hw fence get timeout, we
>>>>> identify it as guilty and do hw reset to recover hw. After that, we
>>>>> would do the normal resubmit step to resubmit left jobs.
>>>>>
>>>>> 4. For whole gpu reset(vram lost), skip Step0 Resubmit as each job
>>>>> after vram lost was considered as bad job.
>>>>>
>>>>> 5. Involve the concept "Affinity Group".
>>>>> Doing two hw resets is not necessary when there's only one ring that
>>>>> has jobs among some hw-related rings.Thus, we involve "affinity
>>>>> group".
>>>>> Hw-related rings could be added into a common affinity group, such as
>>>>> gfx and compute ring. When tdr happens, we iterate all rings in
>>>>> affinity group, skip Step0 Resubmit stage if there's only one ring's
>>>>> mirror_list that has valid sched jobs.
>>>>>
>>>>> V2:
>>>>> -fix a cherry-pick mistake for bailing TDR handling.
>>>>>
>>>>> -do affinity_group check according to the bad job's sched
>>>>> rather
>>>>> than the default "1" so that there could be multiple affinity
>>>>> groups being pre-defined in future.
>>>>>
>>>>> Signed-off-by: Jack Zhang <Jack.Zhang1 at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102
>>>>> +++++++++++++++++++--
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 47 ++++++++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 27 ++++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>>>>> include/drm/gpu_scheduler.h | 1 +
>>>>> 7 files changed, 173 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index e247c3a2ec08..8632d7071292 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4188,6 +4188,37 @@ bool amdgpu_device_has_job_running(struct
>>>>> amdgpu_device *adev)
>>>>> return false;
>>>>> }
>>>>> +bool amdgpu_affinity_group_has_only_or_null_working_ring(struct
>>>>> +amdgpu_device *adev, struct drm_sched_job *s_job) {
>>>>> + int i;
>>>>> + int working_ring_num = 0;
>>>>> +
>>>>> + /*
>>>>> + * The job is considered as the real bad one
>>>>> + * if job's sched is not in affinity group
>>>>> + */
>>>>> + if (s_job->sched.affinity_group == 0)
>>>>> + return true;
>>>>> +
>>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>> + struct amdgpu_ring *ring = adev->rings[i];
>>>>> +
>>>>> + if (!ring || !ring->sched.thread)
>>>>> + continue;
>>>>> +
>>>>> + /* for non-empty affinity ring, increase
>>>>> working_ring_num */
>>>>> + if (ring->sched.affinity_group ==
>>>>> s_job->sched.affinity_group) {
>>>>> + if
>>>>> (!list_empty(&ring->sched.ring_mirror_list))
>>>>> + working_ring_num++;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + if (working_ring_num > 1) {
>>>>> + return false;
>>>>> + }
>>>>> + return true;
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * amdgpu_device_should_recover_gpu - check if we should try
>>>>> GPU recovery
>>>>> *
>>>>> @@ -4310,8 +4341,10 @@ static int
>>>>> amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>> amdgpu_fence_driver_force_completion(ring);
>>>>> }
>>>>> - if(job)
>>>>> - drm_sched_increase_karma(&job->base);
>>>>> + if (amdgpu_gpu_recovery != 2) {
>>>>> + if (job)
>>>>> + drm_sched_increase_karma(&job->base);
>>>>> + }
>>>>> /* Don't suspend on bare metal if we are not going to HW
>>>>> reset the ASIC */
>>>>> if (!amdgpu_sriov_vf(adev)) {
>>>>> @@ -4639,7 +4672,7 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> int i, r = 0;
>>>>> bool need_emergency_restart = false;
>>>>> bool audio_suspended = false;
>>>>> -
>>>>> + int tmp_vram_lost_counter;
>>>>> /*
>>>>> * Special case: RAS triggered and full reset isn't supported
>>>>> */
>>>>> @@ -4690,8 +4723,16 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> job ? job->base.id : -1);
>>>>> /* even we skipped this reset, still need to set the
>>>>> job to guilty */
>>>>> - if (job)
>>>>> - drm_sched_increase_karma(&job->base);
>>>>> + if (job) {
>>>>> + if (amdgpu_gpu_recovery == 2) {
>>>>> + if (&job->base) {
>>>>> + spin_lock(&job->base.sched->job_list_lock);
>>>>> + list_add(&job->base.node,
>>>>> &job->base.sched->ring_mirror_list);
>>>>> + spin_unlock(&job->base.sched->job_list_lock);
>>>>> + }
>>>>> + } else
>>>>> + drm_sched_increase_karma(&job->base);
>>>>> + }
>>>>> goto skip_recovery;
>>>>> }
>>>>> @@ -4788,6 +4829,7 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> }
>>>>> }
>>>>> + tmp_vram_lost_counter =
>>>>> atomic_read(&((adev)->vram_lost_counter));
>>>>> /* Actual ASIC resets if needed.*/
>>>>> /* TODO Implement XGMI hive reset logic for SRIOV */
>>>>> if (amdgpu_sriov_vf(adev)) {
>>>>> @@ -4804,18 +4846,64 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> /* Post ASIC reset for all devs .*/
>>>>> list_for_each_entry(tmp_adev, device_list_handle,
>>>>> gmc.xgmi.head)
>>>>> {
>>>>> + int step = 1;
>>>>> + if (amdgpu_gpu_recovery == 2) {
>>>>> + if
>>>>> (amdgpu_affinity_group_has_only_or_null_working_ring(adev,&job->base)
>>>>> + || tmp_vram_lost_counter <
>>>>> atomic_read(&adev->vram_lost_counter)) {
>>>>> + DRM_INFO("Skip Stage0 Resubmit Stage\n");
>>>>> + /* set guilty */
>>>>> + drm_sched_increase_karma(&job->base);
>>>>> + step = 1;
>>>>> + } else {
>>>>> + DRM_INFO("Do Stage0 Resubmit Stage\n");
>>>>> + step = 0;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> +retry_resubmit:
>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>> struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>> + int ret = 0;
>>>>> + struct drm_sched_job *s_bad_job = NULL;
>>>>> if (!ring || !ring->sched.thread)
>>>>> continue;
>>>>> /* No point to resubmit jobs if we didn't HW reset*/
>>>>> - if (!tmp_adev->asic_reset_res && !job_signaled)
>>>>> + if (!tmp_adev->asic_reset_res && !job_signaled) {
>>>>> +
>>>>> drm_sched_resubmit_jobs(&ring->sched);
>>>>> - drm_sched_start(&ring->sched,
>>>>> !tmp_adev->asic_reset_res);
>>>>> + if (amdgpu_gpu_recovery == 2 && step == 0) {
>>>>> + ret =
>>>>> amdgpu_wait_resubmitted_jobs_completion(&ring->sched,
>>>>> ring->sched.timeout, &s_bad_job);
>>>>> + if (ret == -1) {
>>>>> + DRM_ERROR("Found the real bad job!
>>>>> ring:%s, job_id:%llx\n", ring->sched.name, s_bad_job->id);
>>>>> + /* set guilty */
>>>>> + drm_sched_increase_karma(s_bad_job);
>>>>> +
>>>>> + /* do hw reset */
>>>>> + if (amdgpu_sriov_vf(adev)) {
>>>>> + amdgpu_virt_fini_data_exchange(adev);
>>>>> + r = amdgpu_device_reset_sriov(adev,
>>>>> false);
>>>>> + if (r)
>>>>> + adev->asic_reset_res = r;
>>>>> + } else {
>>>>> + r = amdgpu_do_asic_reset(hive,
>>>>> device_list_handle, &need_full_reset, false);
>>>>> + if (r && r == -EAGAIN)
>>>>> + goto retry;
>>>>> + }
>>>>> +
>>>>> + /* add reset counter so that the
>>>>> following resubmitted job could flush vmid */
>>>>> + atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>> + step = 1;
>>>>> + goto retry_resubmit;
>>>>> + }
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + if (step == 1)
>>>>> + drm_sched_start(&ring->sched,
>>>>> !tmp_adev->asic_reset_res);
>>>>> }
>>>>> if (!amdgpu_device_has_dc_support(tmp_adev) &&
>>>>> !job_signaled) {
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 865f924772b0..9c3f4edb7532 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -509,7 +509,7 @@ module_param_named(compute_multipipe,
>>>>> amdgpu_compute_multipipe, int, 0444);
>>>>> * DOC: gpu_recovery (int)
>>>>> * Set to enable GPU recovery mechanism (1 = enable, 0 =
>>>>> disable). The default is -1 (auto, disabled except SRIOV).
>>>>> */
>>>>> -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =
>>>>> enable, 0 = disable, -1 = auto)");
>>>>> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 =
>>>>> +advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)");
>>>>> module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>>>> /**
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 759b34799221..28cda321157a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -281,6 +281,53 @@ void amdgpu_job_stop_all_jobs_on_sched(struct
>>>>> drm_gpu_scheduler *sched)
>>>>> }
>>>>> }
>>>>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler
>>>>> +*sched, long timeout, struct drm_sched_job **s_bad_job) {
>>>>> + struct drm_sched_job *s_job, *tmp;
>>>>> + int ret = 0;
>>>>> +
>>>>> + list_for_each_entry_safe(s_job, tmp,
>>>>> &sched->ring_mirror_list, node) {
>>>>> + struct drm_sched_fence *s_fence = s_job->s_fence;
>>>>> +
>>>>> + if (s_fence->parent == NULL) { /* fail to get a hw
>>>>> fence */
>>>>> + /* process a job */
>>>>> + atomic_dec(&sched->num_jobs);
>>>>> + dma_fence_get(&s_fence->finished);
>>>>> + dma_fence_signal(&s_fence->finished);
>>>>> + dma_fence_put(&s_fence->finished);
>>>>> +
>>>>> + /* remove node from mirror_list and free the job */
>>>>> + spin_lock(&sched->job_list_lock);
>>>>> + list_del_init(&s_job->node);
>>>>> + spin_unlock(&sched->job_list_lock);
>>>>> + sched->ops->free_job(s_job);
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + ret = dma_fence_wait_timeout(s_fence->parent, false,
>>>>> timeout);
>>>>> +
>>>>> + if (ret > 0) { /* succeed */
>>>>> + /* process a job */
>>>>> + atomic_dec(&sched->num_jobs);
>>>>> + dma_fence_get(&s_fence->finished);
>>>>> + dma_fence_signal(&s_fence->finished);
>>>>> + dma_fence_put(&s_fence->finished);
>>>>> +
>>>>> + /* remove node from mirror_list and free the job */
>>>>> + spin_lock(&sched->job_list_lock);
>>>>> + list_del_init(&s_job->node);
>>>>> + spin_unlock(&sched->job_list_lock);
>>>>> + sched->ops->free_job(s_job);
>>>>> + continue;
>>>>> + } else if (ret == 0) {
>>>>> + *s_bad_job = s_job;
>>>>> + return -1; /* timeout */
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> const struct drm_sched_backend_ops amdgpu_sched_ops = {
>>>>> .dependency = amdgpu_job_dependency,
>>>>> .run_job = amdgpu_job_run,
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> index 81caac9b958a..25292f4699fb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> @@ -76,5 +76,5 @@ int amdgpu_job_submit_direct(struct amdgpu_job
>>>>> *job, struct amdgpu_ring *ring,
>>>>> struct dma_fence **fence);
>>>>> void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler
>>>>> *sched);
>>>>> -
>>>>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler
>>>>> +*sched, long timeout, struct drm_sched_job **s_bad_job);
>>>>> #endif
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index b644c78475fd..cb50bfc80bc9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -35,6 +35,11 @@
>>>>> #include "amdgpu.h"
>>>>> #include "atom.h"
>>>>> +static char *amdgpu_affinity_group[] = { "gfx", "comp"
>>>>> +};
>>>>> +
>>>>> /*
>>>>> * Rings
>>>>> * Most engines on the GPU are fed via ring buffers. Ring @@
>>>>> -189,6
>>>>> +194,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct
>>>>> +amdgpu_ring *ring,
>>>>> ring->adev = adev;
>>>>> ring->idx = adev->num_rings++;
>>>>> adev->rings[ring->idx] = ring;
>>>>> + amdgpu_ring_set_affinity_group(ring);
>>>>> r = amdgpu_fence_driver_init_ring(ring,
>>>>> sched_hw_submission);
>>>>> if (r)
>>>>> return r;
>>>>> @@ -459,3 +465,24 @@ int amdgpu_ring_test_helper(struct
>>>>> amdgpu_ring *ring)
>>>>> ring->sched.ready = !r;
>>>>> return r;
>>>>> }
>>>>> +
>>>>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring) {
>>>>> + struct amdgpu_device *adev = ring->adev;
>>>>> + int i;
>>>>> +
>>>>> + for (i = 0; i < ARRAY_SIZE(amdgpu_affinity_group); i++) {
>>>>> + char *temp_name = amdgpu_affinity_group[i];
>>>>> +
>>>>> + /* set ring's affinity_group bit if find it in
>>>>> affinity_group list */
>>>>> + if (strncmp(ring->name, temp_name,
>>>>> strlen(temp_name)) == 0) {
>>>>> + DRM_DEV_INFO(adev->dev, "set ring:%s in
>>>>> affinity_group\n",
>>>>> + ring->name);
>>>>> + ring->sched.affinity_group = 1;
>>>>> + return 0;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + ring->sched.affinity_group = 0;
>>>>> + return 0;
>>>>> +}
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 56acec1075ac..6b0d217e6f5a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -350,4 +350,5 @@ int amdgpu_debugfs_ring_init(struct
>>>>> amdgpu_device *adev,
>>>>> struct amdgpu_ring *ring);
>>>>> void amdgpu_debugfs_ring_fini(struct amdgpu_ring *ring);
>>>>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring);
>>>>> #endif
>>>>> diff --git a/include/drm/gpu_scheduler.h
>>>>> b/include/drm/gpu_scheduler.h index 1c815e0a14ed..589cbaea35dc 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
>>>>> atomic_t _score;
>>>>> bool ready;
>>>>> bool free_guilty;
>>>>> + int affinity_group;
>>>>> };
>>>>> int drm_sched_init(struct drm_gpu_scheduler *sched,
>>
More information about the amd-gfx
mailing list