[RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

Thu Mar 3 16:36:50 UTC 2022

I pushed all the changes including your patch.

Andrey

On 2022-03-02 22:16, Andrey Grodzovsky wrote:
> OK, i will do quick smoke test tomorrow and push all of it it then.
>
> Andrey
>
> On 2022-03-02 21:59, Chen, JingWen wrote:
>> Hi Andrey,
>>
>> I don't have the bare mental environment, I can only test the SRIOV 
>> cases.
>>
>> Best Regards,
>> JingWen Chen
>>
>>
>>
>>> On Mar 3, 2022, at 01:55, Grodzovsky, Andrey 
>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>
>>> The patch is acked-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>
>>> If you also smoked tested bare metal feel free to apply all the 
>>> patches, if no let me know.
>>>
>>> Andrey
>>>
>>> On 2022-03-02 04:51, JingWen Chen wrote:
>>>> Hi Andrey,
>>>>
>>>> Most part of the patches are OK, but the code will introduce a ib 
>>>> test fail on the disabled vcn of sienna_cichlid.
>>>>
>>>> In SRIOV use case we will disable one vcn on sienna_cichlid, I have 
>>>> attached a patch to fix this issue, please check the attachment.
>>>>
>>>> Best Regards,
>>>>
>>>> Jingwen Chen
>>>>
>>>>
>>>> On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
>>>>> Hey, patches attached - i applied the patches and resolved merge 
>>>>> conflicts but weren't able to test as my on board's network card 
>>>>> doesn't work with 5.16 kernel (it does with 5.17, maybe it's 
>>>>> Kconfig issue and i need to check more).
>>>>> The patches are on top of 'cababde192b2 Yifan Zhang         2 days 
>>>>> ago     drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>>>>>
>>>>> Please test and let me know. Maybe by Monday I will be able to 
>>>>> resolve the connectivity issue on 5.16.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2022-02-24 22:13, JingWen Chen wrote:
>>>>>> Hi Andrey,
>>>>>>
>>>>>> Sorry for the misleading, I mean the whole patch series. We are 
>>>>>> depending on this patch series to fix the concurrency issue 
>>>>>> within SRIOV TDR sequence.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>>>>>> No problem if so but before I do,
>>>>>>>
>>>>>>>
>>>>>>> JingWen - why you think this patch is needed as a standalone now 
>>>>>>> ? It has no use without the
>>>>>>> entire feature together with it. Is it some changes you want to 
>>>>>>> do on top of that code ?
>>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>>>>>> [Public]
>>>>>>>>
>>>>>>>>
>>>>>>>> If it applies cleanly, feel free to drop it in. I'll drop those 
>>>>>>>> patches for drm-next since they are already in drm-misc.
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on 
>>>>>>>> behalf of Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>>>>>> *To:* Chen, JingWen <JingWen.Chen2 at amd.com>; Christian König 
>>>>>>>> <ckoenig.leichtzumerken at gmail.com>; 
>>>>>>>> dri-devel at lists.freedesktop.org 
>>>>>>>> <dri-devel at lists.freedesktop.org>; 
>>>>>>>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>>>>>>> *Cc:* Liu, Monk <Monk.Liu at amd.com>; Chen, Horace 
>>>>>>>> <Horace.Chen at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>; 
>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>; daniel at ffwll.ch 
>>>>>>>> <daniel at ffwll.ch>
>>>>>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init 
>>>>>>>> to after XGMI is ready
>>>>>>>> No because all the patch-set including this patch was landed into
>>>>>>>> drm-misc-next and will reach amd-staging-drm-next on the next 
>>>>>>>> upstream
>>>>>>>> rebase i guess.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>>>>>>> Hi Andrey,
>>>>>>>>>
>>>>>>>>> Will you port this patch into amd-staging-drm-next?
>>>>>>>>>
>>>>>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>>>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>>>>>>> who helped reviewing.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>>>>>>> domain are we in - for single device there iis a single
>>>>>>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>>>>>>> the reset domain spans the entire XGMI hive and so the
>>>>>>>>>>>> reset wq is per hive.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>>>> One more comment below, with that fixed Reviewed-by: 
>>>>>>>>>>> Christian König <christian.koenig at amd.com>.
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 
>>>>>>>>>>>> ++++++++++++++++++++++
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 
>>>>>>>>>>>> ++--------------
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  2 +
>>>>>>>>>>>>       3 files changed, 51 insertions(+), 30 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> @@ -2287,6 +2287,47 @@ static int 
>>>>>>>>>>>> amdgpu_device_fw_loading(struct amdgpu_device *adev)
>>>>>>>>>>>>           return r;
>>>>>>>>>>>>       }
>>>>>>>>>>>>       +static int amdgpu_device_init_schedulers(struct 
>>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    long timeout;
>>>>>>>>>>>> +    int r, i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>>>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> +        /* No need to setup the GPU scheduler for rings 
>>>>>>>>>>>> that don't need it */
>>>>>>>>>>>> +        if (!ring || ring->no_scheduler)
>>>>>>>>>>>> +            continue;
>>>>>>>>>>>> +
>>>>>>>>>>>> +        switch (ring->funcs->type) {
>>>>>>>>>>>> +        case AMDGPU_RING_TYPE_GFX:
>>>>>>>>>>>> +            timeout = adev->gfx_timeout;
>>>>>>>>>>>> +            break;
>>>>>>>>>>>> +        case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>>>>>> +            timeout = adev->compute_timeout;
>>>>>>>>>>>> +            break;
>>>>>>>>>>>> +        case AMDGPU_RING_TYPE_SDMA:
>>>>>>>>>>>> +            timeout = adev->sdma_timeout;
>>>>>>>>>>>> +            break;
>>>>>>>>>>>> +        default:
>>>>>>>>>>>> +            timeout = adev->video_timeout;
>>>>>>>>>>>> +            break;
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +        r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>>>>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>>>>>> +                   timeout, adev->reset_domain.wq, 
>>>>>>>>>>>> ring->sched_score, ring->name);
>>>>>>>>>>>> +        if (r) {
>>>>>>>>>>>> +            DRM_ERROR("Failed to create scheduler on ring 
>>>>>>>>>>>> %s.\n",
>>>>>>>>>>>> +                  ring->name);
>>>>>>>>>>>> +            return r;
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    return 0;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +
>>>>>>>>>>>>       /**
>>>>>>>>>>>>        * amdgpu_device_ip_init - run init for hardware IPs
>>>>>>>>>>>>        *
>>>>>>>>>>>> @@ -2419,6 +2460,10 @@ static int 
>>>>>>>>>>>> amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>>>>>>>>>               }
>>>>>>>>>>>>           }
>>>>>>>>>>>>       +    r = amdgpu_device_init_schedulers(adev);
>>>>>>>>>>>> +    if (r)
>>>>>>>>>>>> +        goto init_failed;
>>>>>>>>>>>> +
>>>>>>>>>>>>           /* Don't init kfd if whole hive need to be reset 
>>>>>>>>>>>> during init */
>>>>>>>>>>>>           if (!adev->gmc.xgmi.pending_reset)
>>>>>>>>>>>> amdgpu_amdkfd_device_init(adev);
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> index 45977a72b5dd..fa302540c69a 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> @@ -457,8 +457,6 @@ int 
>>>>>>>>>>>> amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>>>>>>>                         atomic_t *sched_score)
>>>>>>>>>>>>       {
>>>>>>>>>>>>           struct amdgpu_device *adev = ring->adev;
>>>>>>>>>>>> -    long timeout;
>>>>>>>>>>>> -    int r;
>>>>>>>>>>>>             if (!adev)
>>>>>>>>>>>>               return -EINVAL;
>>>>>>>>>>>> @@ -478,36 +476,12 @@ int 
>>>>>>>>>>>> amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>>>>>>> spin_lock_init(&ring->fence_drv.lock);
>>>>>>>>>>>>           ring->fence_drv.fences = 
>>>>>>>>>>>> kcalloc(num_hw_submission * 2, sizeof(void *),
>>>>>>>>>>>>                            GFP_KERNEL);
>>>>>>>>>>>> -    if (!ring->fence_drv.fences)
>>>>>>>>>>>> -        return -ENOMEM;
>>>>>>>>>>>>       -    /* No need to setup the GPU scheduler for rings 
>>>>>>>>>>>> that don't need it */
>>>>>>>>>>>> -    if (ring->no_scheduler)
>>>>>>>>>>>> -        return 0;
>>>>>>>>>>>> +    ring->num_hw_submission = num_hw_submission;
>>>>>>>>>>>> +    ring->sched_score = sched_score;
>>>>>>>>>>> Let's move this into the caller and then use 
>>>>>>>>>>> ring->num_hw_submission in the fence code as well.
>>>>>>>>>>>
>>>>>>>>>>> The maximum number of jobs on the ring is not really fence 
>>>>>>>>>>> specific.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>       -    switch (ring->funcs->type) {
>>>>>>>>>>>> -    case AMDGPU_RING_TYPE_GFX:
>>>>>>>>>>>> -        timeout = adev->gfx_timeout;
>>>>>>>>>>>> -        break;
>>>>>>>>>>>> -    case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>>>>>> -        timeout = adev->compute_timeout;
>>>>>>>>>>>> -        break;
>>>>>>>>>>>> -    case AMDGPU_RING_TYPE_SDMA:
>>>>>>>>>>>> -        timeout = adev->sdma_timeout;
>>>>>>>>>>>> -        break;
>>>>>>>>>>>> -    default:
>>>>>>>>>>>> -        timeout = adev->video_timeout;
>>>>>>>>>>>> -        break;
>>>>>>>>>>>> -    }
>>>>>>>>>>>> -
>>>>>>>>>>>> -    r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>>>>>> -               num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>>>>>> -               timeout, NULL, sched_score, ring->name);
>>>>>>>>>>>> -    if (r) {
>>>>>>>>>>>> -        DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>>>>>>> -              ring->name);
>>>>>>>>>>>> -        return r;
>>>>>>>>>>>> -    }
>>>>>>>>>>>> +    if (!ring->fence_drv.fences)
>>>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>>>>             return 0;
>>>>>>>>>>>>       }
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> index fae7d185ad0d..7f20ce73a243 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> @@ -251,6 +251,8 @@ struct amdgpu_ring {
>>>>>>>>>>>>           bool has_compute_vm_bug;
>>>>>>>>>>>>           bool            no_scheduler;
>>>>>>>>>>>>           int            hw_prio;
>>>>>>>>>>>> +    unsigned num_hw_submission;
>>>>>>>>>>>> +    atomic_t        *sched_score;
>>>>>>>>>>>>       };
>>>>>>>>>>>>         #define amdgpu_ring_parse_cs(r, p, ib) 
>>>>>>>>>>>> ((r)->funcs->parse_cs((p), (ib)))