[PATCH v2] drm/amd/amdgpu implement tdr advanced mode
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Mar 9 16:48:01 UTC 2021
I we are talking about 'PATCH v3] drm/amd/amdgpu implement tdr advanced
mode' which was sent yesterday then I already went over it and only had
2 cosmetical comments.
Andrey
On 2021-03-09 6:16 a.m., Christian König wrote:
> Yeah, that are some really good points. I completely agree that we
> shouldn't do any larger cleanup right now.
>
> But I think we still need some more review on this. I most likely won't
> have enough time to look into this before the weekend.
>
> Andrey can you take a look as well?
>
> Thanks,
> Christian.
>
> Am 09.03.21 um 08:29 schrieb Liu, Monk:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Christian
>>
>> what feasible and practice now is:
>> 1) we implement the advanced TDR mode in upstream first (so we can
>> copy the same scheme in our LTS kernel) -- if you want we can avoid
>> change drm/scheduler part code, but that one is already rejected by
>> you due to complicated
>> 2) then we retire the mirror list concept and rework the drm/scheduler
>> with KFIFO
>> 3) remove the guilty/karma handling from scheduler
>>
>> So I basically agree with you on the spirit of above changes: hide
>> those AMD internal concept or tricks in vendor's driver part and make
>> scheduler simple and scalable
>> But that definitely need a longer design and discussion, so why don't
>> we focus on our current problems now,
>> as long as the new change doesn't regress it is still a good change
>> based on current TDR implements
>>
>> I would proposal we only change AMD part code in this time, Jack's
>> first version patch didn't touch scheduler part, but you stated it was
>> too complicated and rejected it
>>
>> So the allowable revise options is what Jack did in ver2, which need
>> to introduce a new API in scheduler drm_sched_resubmit_jobs2().
>>
>> Hah: --( what do you think ?
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Monday, March 8, 2021 3:53 PM
>> To: Liu, Monk <Monk.Liu at amd.com>; Zhang, Jack (Jian)
>> <Jack.Zhang1 at amd.com>; amd-gfx at lists.freedesktop.org; Grodzovsky,
>> Andrey <Andrey.Grodzovsky at amd.com>; Deng, Emily <Emily.Deng at amd.com>
>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>>
>>
>>
>> Am 08.03.21 um 05:06 schrieb Liu, Monk:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>>>> well first of all please completely drop the affinity group stuff
>>>>> from this patch. We should concentrate on one feature at at time.
>>> We need it to expedite the process, we can introduce this change in
>>> another patch
>>>
>>>
>>>>> Then the implementation is way to complicate. All you need to do is
>>>>> insert a dma_fence_wait after re-scheduling each job after a reset.
>>> No that's not true, during the " drm_sched_resubmit_jobs" it will put
>>> all jobs in mirror list into the hw ring, but we can only allow the
>>> first job to the ring To catch the real guilty one (otherwise it is
>>> possible that the later job in the ring also has bug and affect our
>>> judgement) So we need to implement a new "
>>> drm_sched_resubmit_jobs2()", like this way:
>> Something like this. But since waiting for the guilty job is AMD
>> specific we should rather rework the stuff from the beginning.
>>
>> What I have in mind is the following:
>> 1. Add a reference from the scheduler fence back to the job which is
>> cleared only when the scheduler fence finishes.
>> 2. Completely drop the ring_mirror_list and replace it with a kfifo of
>> pointers to the active scheduler fences.
>> 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active()
>> macro which allows drivers to iterate over all the active jobs and
>> resubmit/wait/mark them as guilty etc etc..
>> 4. Remove the guilty/karma handling from the scheduler. This is
>> something AMD specific and shouldn't leak into common code.
>>
>> Regards,
>> Christian.
>>
>>> drm_sched_resubmit_jobs2()
>>> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched,
>>> int max)
>>> 500 {
>>> 501 struct drm_sched_job *s_job, *tmp;
>>> 502 uint64_t guilty_context;
>>> 503 bool found_guilty = false;
>>> 504 struct dma_fence *fence;
>>> + 505 int i = 0;
>>> 506
>>> 507 list_for_each_entry_safe(s_job, tmp,
>>> &sched->ring_mirror_list, node) {
>>> 508 struct drm_sched_fence *s_fence = s_job->s_fence;
>>> 509
>>> + 510 if (i >= max)
>>> + 511 break;
>>> + 512
>>> 513 if (!found_guilty && atomic_read(&s_job->karma) >
>>> sched->hang_limit) {
>>> 514 found_guilty = true;
>>> 515 guilty_context = s_job->s_fence->scheduled.context;
>>> 516 }
>>> 517
>>> 518 if (found_guilty && s_job->s_fence->scheduled.context
>>> == guilty_context)
>>> 519 dma_fence_set_error(&s_fence->finished, -ECANCELED);
>>> 520
>>> 521 dma_fence_put(s_job->s_fence->parent);
>>> 522 fence = sched->ops->run_job(s_job);
>>> + 523 i++;
>>> 524
>>> 525 if (IS_ERR_OR_NULL(fence)) {
>>> 526 if (IS_ERR(fence))
>>> 527 dma_fence_set_error(&s_fence->finished,
>>> PTR_ERR(fence));
>>> 528
>>> 529 s_job->s_fence->parent = NULL;
>>> 530 } else {
>>> 531 s_job->s_fence->parent = fence;
>>> 532 }
>>> 533
>>> 534
>>> 535 }
>>> 536 }
>>> 537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>> 538
>>>
>>>
>>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Sunday, March 7, 2021 3:03 AM
>>> To: Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>;
>>> amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Deng, Emily
>>> <Emily.Deng at amd.com>
>>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>>>
>>> Hi Jack,
>>>
>>> well first of all please completely drop the affinity group stuff
>>> from this patch. We should concentrate on one feature at at time.
>>>
>>> Then the implementation is way to complicate. All you need to do is
>>> insert a dma_fence_wait after re-scheduling each job after a reset.
>>>
>>> Additional to that this feature is completely AMD specific and
>>> shouldn't affect the common scheduler in any way.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 06.03.21 um 18:25 schrieb Jack Zhang:
>>>> [Why]
>>>> Previous tdr design treats the first job in job_timeout as the bad job.
>>>> But sometimes a later bad compute job can block a good gfx job and
>>>> cause an unexpected gfx job timeout because gfx and compute ring
>>>> share internal GC HW mutually.
>>>>
>>>> [How]
>>>> This patch implements an advanced tdr mode.It involves an additinal
>>>> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
>>>> step in order to find the real bad job.
>>>>
>>>> 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to
>>>> guilty and leave it to be handled by the main reset thread.
>>>>
>>>> 2. Don't set the job to guilty in pre_asic_reset, and leave it to be
>>>> handled by Step0 Resubmit Stage.
>>>>
>>>> 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously,
>>>> then it iterate each ring mirror_list, synchronously pend for each hw
>>>> fence being signaled. If the a job's hw fence get timeout, we
>>>> identify it as guilty and do hw reset to recover hw. After that, we
>>>> would do the normal resubmit step to resubmit left jobs.
>>>>
>>>> 4. For whole gpu reset(vram lost), skip Step0 Resubmit as each job
>>>> after vram lost was considered as bad job.
>>>>
>>>> 5. Involve the concept "Affinity Group".
>>>> Doing two hw resets is not necessary when there's only one ring that
>>>> has jobs among some hw-related rings.Thus, we involve "affinity group".
>>>> Hw-related rings could be added into a common affinity group, such as
>>>> gfx and compute ring. When tdr happens, we iterate all rings in
>>>> affinity group, skip Step0 Resubmit stage if there's only one ring's
>>>> mirror_list that has valid sched jobs.
>>>>
>>>> V2:
>>>> -fix a cherry-pick mistake for bailing TDR handling.
>>>>
>>>> -do affinity_group check according to the bad job's sched rather
>>>> than the default "1" so that there could be multiple affinity
>>>> groups being pre-defined in future.
>>>>
>>>> Signed-off-by: Jack Zhang <Jack.Zhang1 at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102
>>>> +++++++++++++++++++--
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 47 ++++++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 27 ++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
>>>> include/drm/gpu_scheduler.h | 1 +
>>>> 7 files changed, 173 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index e247c3a2ec08..8632d7071292 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4188,6 +4188,37 @@ bool amdgpu_device_has_job_running(struct
>>>> amdgpu_device *adev)
>>>> return false;
>>>> }
>>>> +bool amdgpu_affinity_group_has_only_or_null_working_ring(struct
>>>> +amdgpu_device *adev, struct drm_sched_job *s_job) {
>>>> + int i;
>>>> + int working_ring_num = 0;
>>>> +
>>>> + /*
>>>> + * The job is considered as the real bad one
>>>> + * if job's sched is not in affinity group
>>>> + */
>>>> + if (s_job->sched.affinity_group == 0)
>>>> + return true;
>>>> +
>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> + struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> + if (!ring || !ring->sched.thread)
>>>> + continue;
>>>> +
>>>> + /* for non-empty affinity ring, increase
>>>> working_ring_num */
>>>> + if (ring->sched.affinity_group ==
>>>> s_job->sched.affinity_group) {
>>>> + if (!list_empty(&ring->sched.ring_mirror_list))
>>>> + working_ring_num++;
>>>> + }
>>>> + }
>>>> +
>>>> + if (working_ring_num > 1) {
>>>> + return false;
>>>> + }
>>>> + return true;
>>>> +}
>>>> +
>>>> /**
>>>> * amdgpu_device_should_recover_gpu - check if we should try GPU
>>>> recovery
>>>> *
>>>> @@ -4310,8 +4341,10 @@ static int
>>>> amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>> amdgpu_fence_driver_force_completion(ring);
>>>> }
>>>> - if(job)
>>>> - drm_sched_increase_karma(&job->base);
>>>> + if (amdgpu_gpu_recovery != 2) {
>>>> + if (job)
>>>> + drm_sched_increase_karma(&job->base);
>>>> + }
>>>> /* Don't suspend on bare metal if we are not going to HW
>>>> reset the ASIC */
>>>> if (!amdgpu_sriov_vf(adev)) {
>>>> @@ -4639,7 +4672,7 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>> int i, r = 0;
>>>> bool need_emergency_restart = false;
>>>> bool audio_suspended = false;
>>>> -
>>>> + int tmp_vram_lost_counter;
>>>> /*
>>>> * Special case: RAS triggered and full reset isn't supported
>>>> */
>>>> @@ -4690,8 +4723,16 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>> job ? job->base.id : -1);
>>>> /* even we skipped this reset, still need to set the job
>>>> to guilty */
>>>> - if (job)
>>>> - drm_sched_increase_karma(&job->base);
>>>> + if (job) {
>>>> + if (amdgpu_gpu_recovery == 2) {
>>>> + if (&job->base) {
>>>> + spin_lock(&job->base.sched->job_list_lock);
>>>> + list_add(&job->base.node,
>>>> &job->base.sched->ring_mirror_list);
>>>> + spin_unlock(&job->base.sched->job_list_lock);
>>>> + }
>>>> + } else
>>>> + drm_sched_increase_karma(&job->base);
>>>> + }
>>>> goto skip_recovery;
>>>> }
>>>> @@ -4788,6 +4829,7 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>> }
>>>> }
>>>> + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>>>> /* Actual ASIC resets if needed.*/
>>>> /* TODO Implement XGMI hive reset logic for SRIOV */
>>>> if (amdgpu_sriov_vf(adev)) {
>>>> @@ -4804,18 +4846,64 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>> /* Post ASIC reset for all devs .*/
>>>> list_for_each_entry(tmp_adev, device_list_handle,
>>>> gmc.xgmi.head)
>>>> {
>>>> + int step = 1;
>>>> + if (amdgpu_gpu_recovery == 2) {
>>>> + if
>>>> (amdgpu_affinity_group_has_only_or_null_working_ring(adev,&job->base)
>>>> + || tmp_vram_lost_counter <
>>>> atomic_read(&adev->vram_lost_counter)) {
>>>> + DRM_INFO("Skip Stage0 Resubmit Stage\n");
>>>> + /* set guilty */
>>>> + drm_sched_increase_karma(&job->base);
>>>> + step = 1;
>>>> + } else {
>>>> + DRM_INFO("Do Stage0 Resubmit Stage\n");
>>>> + step = 0;
>>>> + }
>>>> + }
>>>> +
>>>> +retry_resubmit:
>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>> + int ret = 0;
>>>> + struct drm_sched_job *s_bad_job = NULL;
>>>> if (!ring || !ring->sched.thread)
>>>> continue;
>>>> /* No point to resubmit jobs if we didn't HW reset*/
>>>> - if (!tmp_adev->asic_reset_res && !job_signaled)
>>>> + if (!tmp_adev->asic_reset_res && !job_signaled) {
>>>> +
>>>> drm_sched_resubmit_jobs(&ring->sched);
>>>> - drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res);
>>>> + if (amdgpu_gpu_recovery == 2 && step == 0) {
>>>> + ret =
>>>> amdgpu_wait_resubmitted_jobs_completion(&ring->sched,
>>>> ring->sched.timeout, &s_bad_job);
>>>> + if (ret == -1) {
>>>> + DRM_ERROR("Found the real bad job! ring:%s,
>>>> job_id:%llx\n", ring->sched.name, s_bad_job->id);
>>>> + /* set guilty */
>>>> + drm_sched_increase_karma(s_bad_job);
>>>> +
>>>> + /* do hw reset */
>>>> + if (amdgpu_sriov_vf(adev)) {
>>>> + amdgpu_virt_fini_data_exchange(adev);
>>>> + r = amdgpu_device_reset_sriov(adev,
>>>> false);
>>>> + if (r)
>>>> + adev->asic_reset_res = r;
>>>> + } else {
>>>> + r = amdgpu_do_asic_reset(hive,
>>>> device_list_handle, &need_full_reset, false);
>>>> + if (r && r == -EAGAIN)
>>>> + goto retry;
>>>> + }
>>>> +
>>>> + /* add reset counter so that the following
>>>> resubmitted job could flush vmid */
>>>> + atomic_inc(&tmp_adev->gpu_reset_counter);
>>>> + step = 1;
>>>> + goto retry_resubmit;
>>>> + }
>>>> + }
>>>> + }
>>>> +
>>>> + if (step == 1)
>>>> + drm_sched_start(&ring->sched,
>>>> !tmp_adev->asic_reset_res);
>>>> }
>>>> if (!amdgpu_device_has_dc_support(tmp_adev) &&
>>>> !job_signaled) {
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 865f924772b0..9c3f4edb7532 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -509,7 +509,7 @@ module_param_named(compute_multipipe,
>>>> amdgpu_compute_multipipe, int, 0444);
>>>> * DOC: gpu_recovery (int)
>>>> * Set to enable GPU recovery mechanism (1 = enable, 0 =
>>>> disable). The default is -1 (auto, disabled except SRIOV).
>>>> */
>>>> -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =
>>>> enable, 0 = disable, -1 = auto)");
>>>> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 =
>>>> +advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)");
>>>> module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>>> /**
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 759b34799221..28cda321157a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -281,6 +281,53 @@ void amdgpu_job_stop_all_jobs_on_sched(struct
>>>> drm_gpu_scheduler *sched)
>>>> }
>>>> }
>>>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler
>>>> +*sched, long timeout, struct drm_sched_job **s_bad_job) {
>>>> + struct drm_sched_job *s_job, *tmp;
>>>> + int ret = 0;
>>>> +
>>>> + list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list,
>>>> node) {
>>>> + struct drm_sched_fence *s_fence = s_job->s_fence;
>>>> +
>>>> + if (s_fence->parent == NULL) { /* fail to get a hw
>>>> fence */
>>>> + /* process a job */
>>>> + atomic_dec(&sched->num_jobs);
>>>> + dma_fence_get(&s_fence->finished);
>>>> + dma_fence_signal(&s_fence->finished);
>>>> + dma_fence_put(&s_fence->finished);
>>>> +
>>>> + /* remove node from mirror_list and free the job */
>>>> + spin_lock(&sched->job_list_lock);
>>>> + list_del_init(&s_job->node);
>>>> + spin_unlock(&sched->job_list_lock);
>>>> + sched->ops->free_job(s_job);
>>>> + continue;
>>>> + }
>>>> +
>>>> + ret = dma_fence_wait_timeout(s_fence->parent, false,
>>>> timeout);
>>>> +
>>>> + if (ret > 0) { /* succeed */
>>>> + /* process a job */
>>>> + atomic_dec(&sched->num_jobs);
>>>> + dma_fence_get(&s_fence->finished);
>>>> + dma_fence_signal(&s_fence->finished);
>>>> + dma_fence_put(&s_fence->finished);
>>>> +
>>>> + /* remove node from mirror_list and free the job */
>>>> + spin_lock(&sched->job_list_lock);
>>>> + list_del_init(&s_job->node);
>>>> + spin_unlock(&sched->job_list_lock);
>>>> + sched->ops->free_job(s_job);
>>>> + continue;
>>>> + } else if (ret == 0) {
>>>> + *s_bad_job = s_job;
>>>> + return -1; /* timeout */
>>>> + }
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> const struct drm_sched_backend_ops amdgpu_sched_ops = {
>>>> .dependency = amdgpu_job_dependency,
>>>> .run_job = amdgpu_job_run,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> index 81caac9b958a..25292f4699fb 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>> @@ -76,5 +76,5 @@ int amdgpu_job_submit_direct(struct amdgpu_job
>>>> *job, struct amdgpu_ring *ring,
>>>> struct dma_fence **fence);
>>>> void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler
>>>> *sched);
>>>> -
>>>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler
>>>> +*sched, long timeout, struct drm_sched_job **s_bad_job);
>>>> #endif
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> index b644c78475fd..cb50bfc80bc9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> @@ -35,6 +35,11 @@
>>>> #include "amdgpu.h"
>>>> #include "atom.h"
>>>> +static char *amdgpu_affinity_group[] = { "gfx", "comp"
>>>> +};
>>>> +
>>>> /*
>>>> * Rings
>>>> * Most engines on the GPU are fed via ring buffers. Ring @@
>>>> -189,6
>>>> +194,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct
>>>> +amdgpu_ring *ring,
>>>> ring->adev = adev;
>>>> ring->idx = adev->num_rings++;
>>>> adev->rings[ring->idx] = ring;
>>>> + amdgpu_ring_set_affinity_group(ring);
>>>> r = amdgpu_fence_driver_init_ring(ring,
>>>> sched_hw_submission);
>>>> if (r)
>>>> return r;
>>>> @@ -459,3 +465,24 @@ int amdgpu_ring_test_helper(struct amdgpu_ring
>>>> *ring)
>>>> ring->sched.ready = !r;
>>>> return r;
>>>> }
>>>> +
>>>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring) {
>>>> + struct amdgpu_device *adev = ring->adev;
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < ARRAY_SIZE(amdgpu_affinity_group); i++) {
>>>> + char *temp_name = amdgpu_affinity_group[i];
>>>> +
>>>> + /* set ring's affinity_group bit if find it in
>>>> affinity_group list */
>>>> + if (strncmp(ring->name, temp_name,
>>>> strlen(temp_name)) == 0) {
>>>> + DRM_DEV_INFO(adev->dev, "set ring:%s in
>>>> affinity_group\n",
>>>> + ring->name);
>>>> + ring->sched.affinity_group = 1;
>>>> + return 0;
>>>> + }
>>>> + }
>>>> +
>>>> + ring->sched.affinity_group = 0;
>>>> + return 0;
>>>> +}
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 56acec1075ac..6b0d217e6f5a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -350,4 +350,5 @@ int amdgpu_debugfs_ring_init(struct
>>>> amdgpu_device *adev,
>>>> struct amdgpu_ring *ring);
>>>> void amdgpu_debugfs_ring_fini(struct amdgpu_ring *ring);
>>>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring);
>>>> #endif
>>>> diff --git a/include/drm/gpu_scheduler.h
>>>> b/include/drm/gpu_scheduler.h index 1c815e0a14ed..589cbaea35dc 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
>>>> atomic_t _score;
>>>> bool ready;
>>>> bool free_guilty;
>>>> + int affinity_group;
>>>> };
>>>> int drm_sched_init(struct drm_gpu_scheduler *sched,
>
More information about the amd-gfx
mailing list