[RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

Andrey Grodzovsky andrey.grodzovsky at amd.com
Fri Feb 25 21:22:53 UTC 2022


Hey, patches attached - i applied the patches and resolved merge 
conflicts but weren't able to test as my on board's network card doesn't 
work with 5.16 kernel (it does with 5.17, maybe it's Kconfig issue and i 
need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang         2 days 
ago     drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.

Please test and let me know. Maybe by Monday I will be able to resolve 
the connectivity issue on 5.16.

Andrey

On 2022-02-24 22:13, JingWen Chen wrote:
> Hi Andrey,
>
> Sorry for the misleading, I mean the whole patch series. We are depending on this patch series to fix the concurrency issue within SRIOV TDR sequence.
>
>
>
> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>> No problem if so but before I do,
>>
>>
>> JingWen - why you think this patch is needed as a standalone now ? It has no use without the
>> entire feature together with it. Is it some changes you want to do on top of that code ?
>>
>>
>> Andrey
>>
>>
>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>> [Public]
>>>
>>>
>>> If it applies cleanly, feel free to drop it in.  I'll drop those patches for drm-next since they are already in drm-misc.
>>>
>>> Alex
>>>
>>> ------------------------------------------------------------------------
>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>> *To:* Chen, JingWen <JingWen.Chen2 at amd.com>; Christian König <ckoenig.leichtzumerken at gmail.com>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>> *Cc:* Liu, Monk <Monk.Liu at amd.com>; Chen, Horace <Horace.Chen at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; daniel at ffwll.ch <daniel at ffwll.ch>
>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready
>>> No because all the patch-set including this patch was landed into
>>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>>> rebase i guess.
>>>
>>> Andrey
>>>
>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>> Hi Andrey,
>>>>
>>>> Will you port this patch into amd-staging-drm-next?
>>>>
>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>> who helped reviewing.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>> domain are we in - for single device there iis a single
>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>> the reset domain spans the entire XGMI hive and so the
>>>>>>> reset wq is per hive.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>> One more comment below, with that fixed Reviewed-by: Christian König <christian.koenig at amd.com>.
>>>>>>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>>>>>      3 files changed, 51 insertions(+), 30 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
>>>>>>>          return r;
>>>>>>>      }
>>>>>>>      +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>>>>>> +{
>>>>>>> +    long timeout;
>>>>>>> +    int r, i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>>>>> +
>>>>>>> +        /* No need to setup the GPU scheduler for rings that don't need it */
>>>>>>> +        if (!ring || ring->no_scheduler)
>>>>>>> +            continue;
>>>>>>> +
>>>>>>> +        switch (ring->funcs->type) {
>>>>>>> +        case AMDGPU_RING_TYPE_GFX:
>>>>>>> +            timeout = adev->gfx_timeout;
>>>>>>> +            break;
>>>>>>> +        case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>> +            timeout = adev->compute_timeout;
>>>>>>> +            break;
>>>>>>> +        case AMDGPU_RING_TYPE_SDMA:
>>>>>>> +            timeout = adev->sdma_timeout;
>>>>>>> +            break;
>>>>>>> +        default:
>>>>>>> +            timeout = adev->video_timeout;
>>>>>>> +            break;
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>>>>>>> +                   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
>>>>>>> +        if (r) {
>>>>>>> +            DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>> +                  ring->name);
>>>>>>> +            return r;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>>      /**
>>>>>>>       * amdgpu_device_ip_init - run init for hardware IPs
>>>>>>>       *
>>>>>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>>>>              }
>>>>>>>          }
>>>>>>>      +    r = amdgpu_device_init_schedulers(adev);
>>>>>>> +    if (r)
>>>>>>> +        goto init_failed;
>>>>>>> +
>>>>>>>          /* Don't init kfd if whole hive need to be reset during init */
>>>>>>>          if (!adev->gmc.xgmi.pending_reset)
>>>>>>> amdgpu_amdkfd_device_init(adev);
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 45977a72b5dd..fa302540c69a 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>>                        atomic_t *sched_score)
>>>>>>>      {
>>>>>>>          struct amdgpu_device *adev = ring->adev;
>>>>>>> -    long timeout;
>>>>>>> -    int r;
>>>>>>>            if (!adev)
>>>>>>>              return -EINVAL;
>>>>>>> @@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>> spin_lock_init(&ring->fence_drv.lock);
>>>>>>>          ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
>>>>>>>                           GFP_KERNEL);
>>>>>>> -    if (!ring->fence_drv.fences)
>>>>>>> -        return -ENOMEM;
>>>>>>>      -    /* No need to setup the GPU scheduler for rings that don't need it */
>>>>>>> -    if (ring->no_scheduler)
>>>>>>> -        return 0;
>>>>>>> +    ring->num_hw_submission = num_hw_submission;
>>>>>>> +    ring->sched_score = sched_score;
>>>>>> Let's move this into the caller and then use ring->num_hw_submission in the fence code as well.
>>>>>>
>>>>>> The maximum number of jobs on the ring is not really fence specific.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>      -    switch (ring->funcs->type) {
>>>>>>> -    case AMDGPU_RING_TYPE_GFX:
>>>>>>> -        timeout = adev->gfx_timeout;
>>>>>>> -        break;
>>>>>>> -    case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>> -        timeout = adev->compute_timeout;
>>>>>>> -        break;
>>>>>>> -    case AMDGPU_RING_TYPE_SDMA:
>>>>>>> -        timeout = adev->sdma_timeout;
>>>>>>> -        break;
>>>>>>> -    default:
>>>>>>> -        timeout = adev->video_timeout;
>>>>>>> -        break;
>>>>>>> -    }
>>>>>>> -
>>>>>>> -    r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>> -               num_hw_submission, amdgpu_job_hang_limit,
>>>>>>> -               timeout, NULL, sched_score, ring->name);
>>>>>>> -    if (r) {
>>>>>>> -        DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>> -              ring->name);
>>>>>>> -        return r;
>>>>>>> -    }
>>>>>>> +    if (!ring->fence_drv.fences)
>>>>>>> +        return -ENOMEM;
>>>>>>>            return 0;
>>>>>>>      }
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>> index fae7d185ad0d..7f20ce73a243 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>> @@ -251,6 +251,8 @@ struct amdgpu_ring {
>>>>>>>          bool has_compute_vm_bug;
>>>>>>>          bool            no_scheduler;
>>>>>>>          int            hw_prio;
>>>>>>> +    unsigned num_hw_submission;
>>>>>>> +    atomic_t        *sched_score;
>>>>>>>      };
>>>>>>>        #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0012-drm-amdgpu-Fix-compile-error.patch
Type: text/x-patch
Size: 1471 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0012.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0011-drm-amdgpu-Revert-drm-amdgpu-annotate-a-false-positi.patch
Type: text/x-patch
Size: 2970 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0013.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0010-drm-amdgpu-Rework-amdgpu_device_lock_adev.patch
Type: text/x-patch
Size: 6172 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0014.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0009-drm-amdgpu-Move-in_gpu_reset-into-reset_domain.patch
Type: text/x-patch
Size: 5851 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0015.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0008-drm-amdgpu-Move-reset-sem-into-reset_domain.patch
Type: text/x-patch
Size: 16466 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0016.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0007-drm-amdgpu-Rework-reset-domain-to-be-refcounted.patch
Type: text/x-patch
Size: 13262 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0017.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0006-drm-amdgpu-Drop-concurrent-GPU-reset-protection-for-.patch
Type: text/x-patch
Size: 5888 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0018.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0005-drm-amdgpu-Drop-hive-in_reset.patch
Type: text/x-patch
Size: 3317 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0019.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-drm-amd-virt-For-SRIOV-send-GPU-reset-directly-to-TD.patch
Type: text/x-patch
Size: 4162 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0020.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-drm-amdgpu-Serialize-non-TDR-gpu-recovery-with-TDRs.patch
Type: text/x-patch
Size: 4070 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0021.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-drm-amdgpu-Move-scheduler-init-to-after-XGMI-is-read.patch
Type: text/x-patch
Size: 6766 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0022.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amdgpu-Introduce-reset-domain.patch
Type: text/x-patch
Size: 4391 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220225/f78dd40a/attachment-0023.bin>


More information about the amd-gfx mailing list