[PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV

Thu May 22 03:49:44 UTC 2025

On 2025/5/21 20:00, Christian König wrote:
> On 5/21/25 13:55, Zhang, GuoQing (Sam) wrote:
>> On 2025/5/21 16:06, Christian König wrote:
>>> On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote:
>>>>>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>>>>>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>>>>>> +             */
>>>>>> +            amdgpu_gmc_vram_location(adev, mc, 0);
>>>>> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>>>>>
>>>>> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.
>>>> I tried only setting mc->vram_start and mc->vram_end. But KMD load will
>>>> fail with following error logs.
>>>>
>>>> [  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
>>>> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
>>>> [  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
>>>> 0x0000018000000000 - 0x000001801FFFFFFF
>>>> [  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
>>>> [  329.314386] [drm] RAM width 8192bits HBM
>>>> [  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
>>>> kernel bo
>>>> [  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
>>>> block <gmc_v9_0> failed -22
>>>> [  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
>>>>
>>>>
>>>> It seems like setting mc->visible_vram_size and mc->visible_vram_size
>>>> fields are also needed. In this case call amdgpu_gmc_vram_location() is
>>>> better than inline the logic, I think.
>>> Yeah, exactly that is not a good idea.
>>>
>>> The mc->visible_vram_size and mc->real_vram_size should have been initialized by gmc_v9_0_mc_init(). Why didn't that happen?
>>
>> [Sam] visible_vram_size is set to 0x4000000000 (256G) from
>> `pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`.
>> It is set to real_vram_size 0x2fec000000(192G) in
>> amdgpu_gmc_vram_location().
>>
>> Should I update the 3 variables inline and not call
>> amdgpu_gmc_vram_location()?
>>
>>           mc->vram_start = 0;
>>           mc->vram_end = mc->vram_start + mc->mc_vram_size - 1;
>>           if (mc->real_vram_size < mc->visible_vram_size)
>>               mc->visible_vram_size = mc->real_vram_size;
> Yeah that seems to make sense.
>
>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> index 84cde1239ee4..18e80aa78aff 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>>>>>           top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>>>>>           top <<= 24;
>>>>>>
>>>>>> -    adev->gmc.fb_start = base;
>>>>>> -    adev->gmc.fb_end = top;
>>>>>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>>> +            adev->gmc.fb_start = base;
>>>>>> +            adev->gmc.fb_end = top;
>>>>>> +    }
>>>>> We should probably avoid calling this in the first place.
>>>>>
>>>>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.
>>>> mmhub_v1_8_get_fb_location() is called by the new
>>>> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
>>> Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should only rely on cached data.
>>
>> [Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the
>> value of `get_fb_location()`?
> No, please try to avoid that.

OK. so "amdgpu_bo_fb_aper_addr() should only rely on cached data." is
not required and I don't need to change current amdgpu_bo_fb_aper_addr()
implementation, right?

>
>> Using this approach, we don't need to set fb_start and fb_end on resume
>> any more, since the reset of the 2 field is caused by
>> mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr().
>> Please see the code change below.
> What is wrong with setting fb_start and fb_end on resume?

It works. I have updated the patch in this way.

>>>> mmhub_v1_8_get_fb_location() is supposed to be a query api according to
>>>> its name. having such side effect is very surprising.
>>>>
>>>> Another approach is set the right fb_start and fb_end in the new
>>>> amdgpu_virt_resume(), like updating vram_base_offset.
>>> That is probably better. And skip setting fb_start and fb_end in amdgpu_gmc_sysvm_location() for this use case.

setting fb_start and fb_end in amdgpu_gmc_sysvm_location() is needed for
normal KMD load, since amdgpu_virt_resume() is not called on normal KMD
load.

I have sent out v7 patch list. Please take another look. Thank you!

mail titles:
[PATCH v7 0/4] enable xgmi node migration support for hibernate on SRIOV
[PATCH v7 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume
[PATCH v7 2/4] drm/amdgpu: update GPU addresses for SMU and PSP
[PATCH v7 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
[PATCH v7 4/4] drm/amdgpu: fix fence fallback timer expired error

changes:
- remove pdb0_enabled and add gmc_v9_0_is_pdb0_enabled()
- remove amdgpu_gmc_vram_location() call in amdgpu_gmc_sysvm_location()
- remove check in mmhub_v1_8_get_fb_location() and update
fb_start/fb_end on resume

Regards
Sam

>>>
>>> That was done only because we re-program those registers on bare metal.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Which approach do you prefer? Or any better suggestions? Thank you.
>>>>
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>           return base;
>>>>>>       }
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250522/4ecff7ef/attachment-0001.htm>