[RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Thu Mar 3 16:36:50 UTC 2022
I pushed all the changes including your patch.
Andrey
On 2022-03-02 22:16, Andrey Grodzovsky wrote:
> OK, i will do quick smoke test tomorrow and push all of it it then.
>
> Andrey
>
> On 2022-03-02 21:59, Chen, JingWen wrote:
>> Hi Andrey,
>>
>> I don't have the bare mental environment, I can only test the SRIOV
>> cases.
>>
>> Best Regards,
>> JingWen Chen
>>
>>
>>
>>> On Mar 3, 2022, at 01:55, Grodzovsky, Andrey
>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>
>>> The patch is acked-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>
>>> If you also smoked tested bare metal feel free to apply all the
>>> patches, if no let me know.
>>>
>>> Andrey
>>>
>>> On 2022-03-02 04:51, JingWen Chen wrote:
>>>> Hi Andrey,
>>>>
>>>> Most part of the patches are OK, but the code will introduce a ib
>>>> test fail on the disabled vcn of sienna_cichlid.
>>>>
>>>> In SRIOV use case we will disable one vcn on sienna_cichlid, I have
>>>> attached a patch to fix this issue, please check the attachment.
>>>>
>>>> Best Regards,
>>>>
>>>> Jingwen Chen
>>>>
>>>>
>>>> On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
>>>>> Hey, patches attached - i applied the patches and resolved merge
>>>>> conflicts but weren't able to test as my on board's network card
>>>>> doesn't work with 5.16 kernel (it does with 5.17, maybe it's
>>>>> Kconfig issue and i need to check more).
>>>>> The patches are on top of 'cababde192b2 Yifan Zhang 2 days
>>>>> ago drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>>>>>
>>>>> Please test and let me know. Maybe by Monday I will be able to
>>>>> resolve the connectivity issue on 5.16.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2022-02-24 22:13, JingWen Chen wrote:
>>>>>> Hi Andrey,
>>>>>>
>>>>>> Sorry for the misleading, I mean the whole patch series. We are
>>>>>> depending on this patch series to fix the concurrency issue
>>>>>> within SRIOV TDR sequence.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>>>>>> No problem if so but before I do,
>>>>>>>
>>>>>>>
>>>>>>> JingWen - why you think this patch is needed as a standalone now
>>>>>>> ? It has no use without the
>>>>>>> entire feature together with it. Is it some changes you want to
>>>>>>> do on top of that code ?
>>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>>>>>> [Public]
>>>>>>>>
>>>>>>>>
>>>>>>>> If it applies cleanly, feel free to drop it in. I'll drop those
>>>>>>>> patches for drm-next since they are already in drm-misc.
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>> behalf of Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>>>>>> *To:* Chen, JingWen <JingWen.Chen2 at amd.com>; Christian König
>>>>>>>> <ckoenig.leichtzumerken at gmail.com>;
>>>>>>>> dri-devel at lists.freedesktop.org
>>>>>>>> <dri-devel at lists.freedesktop.org>;
>>>>>>>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>>>>>>> *Cc:* Liu, Monk <Monk.Liu at amd.com>; Chen, Horace
>>>>>>>> <Horace.Chen at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>;
>>>>>>>> Koenig, Christian <Christian.Koenig at amd.com>; daniel at ffwll.ch
>>>>>>>> <daniel at ffwll.ch>
>>>>>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init
>>>>>>>> to after XGMI is ready
>>>>>>>> No because all the patch-set including this patch was landed into
>>>>>>>> drm-misc-next and will reach amd-staging-drm-next on the next
>>>>>>>> upstream
>>>>>>>> rebase i guess.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>>>>>>> Hi Andrey,
>>>>>>>>>
>>>>>>>>> Will you port this patch into amd-staging-drm-next?
>>>>>>>>>
>>>>>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>>>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>>>>>>> who helped reviewing.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>>>>>>> domain are we in - for single device there iis a single
>>>>>>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>>>>>>> the reset domain spans the entire XGMI hive and so the
>>>>>>>>>>>> reset wq is per hive.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>>>> One more comment below, with that fixed Reviewed-by:
>>>>>>>>>>> Christian König <christian.koenig at amd.com>.
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45
>>>>>>>>>>>> ++++++++++++++++++++++
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 34
>>>>>>>>>>>> ++--------------
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +
>>>>>>>>>>>> 3 files changed, 51 insertions(+), 30 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>> @@ -2287,6 +2287,47 @@ static int
>>>>>>>>>>>> amdgpu_device_fw_loading(struct amdgpu_device *adev)
>>>>>>>>>>>> return r;
>>>>>>>>>>>> }
>>>>>>>>>>>> +static int amdgpu_device_init_schedulers(struct
>>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + long timeout;
>>>>>>>>>>>> + int r, i;
>>>>>>>>>>>> +
>>>>>>>>>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>>>>>> + struct amdgpu_ring *ring = adev->rings[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* No need to setup the GPU scheduler for rings
>>>>>>>>>>>> that don't need it */
>>>>>>>>>>>> + if (!ring || ring->no_scheduler)
>>>>>>>>>>>> + continue;
>>>>>>>>>>>> +
>>>>>>>>>>>> + switch (ring->funcs->type) {
>>>>>>>>>>>> + case AMDGPU_RING_TYPE_GFX:
>>>>>>>>>>>> + timeout = adev->gfx_timeout;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>>>>>> + timeout = adev->compute_timeout;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + case AMDGPU_RING_TYPE_SDMA:
>>>>>>>>>>>> + timeout = adev->sdma_timeout;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + default:
>>>>>>>>>>>> + timeout = adev->video_timeout;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>>>>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>>>>>> + timeout, adev->reset_domain.wq,
>>>>>>>>>>>> ring->sched_score, ring->name);
>>>>>>>>>>>> + if (r) {
>>>>>>>>>>>> + DRM_ERROR("Failed to create scheduler on ring
>>>>>>>>>>>> %s.\n",
>>>>>>>>>>>> + ring->name);
>>>>>>>>>>>> + return r;
>>>>>>>>>>>> + }
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + return 0;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +
>>>>>>>>>>>> /**
>>>>>>>>>>>> * amdgpu_device_ip_init - run init for hardware IPs
>>>>>>>>>>>> *
>>>>>>>>>>>> @@ -2419,6 +2460,10 @@ static int
>>>>>>>>>>>> amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> + r = amdgpu_device_init_schedulers(adev);
>>>>>>>>>>>> + if (r)
>>>>>>>>>>>> + goto init_failed;
>>>>>>>>>>>> +
>>>>>>>>>>>> /* Don't init kfd if whole hive need to be reset
>>>>>>>>>>>> during init */
>>>>>>>>>>>> if (!adev->gmc.xgmi.pending_reset)
>>>>>>>>>>>> amdgpu_amdkfd_device_init(adev);
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> index 45977a72b5dd..fa302540c69a 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>>>>>> @@ -457,8 +457,6 @@ int
>>>>>>>>>>>> amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>>>>>>> atomic_t *sched_score)
>>>>>>>>>>>> {
>>>>>>>>>>>> struct amdgpu_device *adev = ring->adev;
>>>>>>>>>>>> - long timeout;
>>>>>>>>>>>> - int r;
>>>>>>>>>>>> if (!adev)
>>>>>>>>>>>> return -EINVAL;
>>>>>>>>>>>> @@ -478,36 +476,12 @@ int
>>>>>>>>>>>> amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>>>>>>>>>>>> spin_lock_init(&ring->fence_drv.lock);
>>>>>>>>>>>> ring->fence_drv.fences =
>>>>>>>>>>>> kcalloc(num_hw_submission * 2, sizeof(void *),
>>>>>>>>>>>> GFP_KERNEL);
>>>>>>>>>>>> - if (!ring->fence_drv.fences)
>>>>>>>>>>>> - return -ENOMEM;
>>>>>>>>>>>> - /* No need to setup the GPU scheduler for rings
>>>>>>>>>>>> that don't need it */
>>>>>>>>>>>> - if (ring->no_scheduler)
>>>>>>>>>>>> - return 0;
>>>>>>>>>>>> + ring->num_hw_submission = num_hw_submission;
>>>>>>>>>>>> + ring->sched_score = sched_score;
>>>>>>>>>>> Let's move this into the caller and then use
>>>>>>>>>>> ring->num_hw_submission in the fence code as well.
>>>>>>>>>>>
>>>>>>>>>>> The maximum number of jobs on the ring is not really fence
>>>>>>>>>>> specific.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> - switch (ring->funcs->type) {
>>>>>>>>>>>> - case AMDGPU_RING_TYPE_GFX:
>>>>>>>>>>>> - timeout = adev->gfx_timeout;
>>>>>>>>>>>> - break;
>>>>>>>>>>>> - case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>>>>>> - timeout = adev->compute_timeout;
>>>>>>>>>>>> - break;
>>>>>>>>>>>> - case AMDGPU_RING_TYPE_SDMA:
>>>>>>>>>>>> - timeout = adev->sdma_timeout;
>>>>>>>>>>>> - break;
>>>>>>>>>>>> - default:
>>>>>>>>>>>> - timeout = adev->video_timeout;
>>>>>>>>>>>> - break;
>>>>>>>>>>>> - }
>>>>>>>>>>>> -
>>>>>>>>>>>> - r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>>>>>> - num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>>>>>> - timeout, NULL, sched_score, ring->name);
>>>>>>>>>>>> - if (r) {
>>>>>>>>>>>> - DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>>>>>>> - ring->name);
>>>>>>>>>>>> - return r;
>>>>>>>>>>>> - }
>>>>>>>>>>>> + if (!ring->fence_drv.fences)
>>>>>>>>>>>> + return -ENOMEM;
>>>>>>>>>>>> return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> index fae7d185ad0d..7f20ce73a243 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>>>>>> @@ -251,6 +251,8 @@ struct amdgpu_ring {
>>>>>>>>>>>> bool has_compute_vm_bug;
>>>>>>>>>>>> bool no_scheduler;
>>>>>>>>>>>> int hw_prio;
>>>>>>>>>>>> + unsigned num_hw_submission;
>>>>>>>>>>>> + atomic_t *sched_score;
>>>>>>>>>>>> };
>>>>>>>>>>>> #define amdgpu_ring_parse_cs(r, p, ib)
>>>>>>>>>>>> ((r)->funcs->parse_cs((p), (ib)))
More information about the amd-gfx
mailing list