[PATCH 6/7] drm/amdgpu: Check gmc requirement for reset on init

Lazar, Lijo lijo.lazar at amd.com
Fri Sep 27 04:39:51 UTC 2024



On 9/27/2024 9:26 AM, Xu, Feifei wrote:
>>>I guess you are referring to the below corner case
> 
>>>            1) Place NPS request
> 
>>>         2) Unload Driver
> 
>>>            3) Reinstall driver with a different TOS (possible but quite
> unlikely)
> 
>>>             4) Driver reload
> 
>>>             5) Driver checks TOS version first and goes for a reset
> 
>>>             6) reset_flag of GMC is not set, hence it doesn't refresh
> the NPS range.
> 
>  
> 
>  
> 
>>>I think changing the order in soc15_need_reset_on_init() to check for
> NPS request before TOS version check will solve this.
> 
> Yes, I was thinking of  reset_flag and tOS reloading
> (adev->init_lvl->level set to AMDGPU_INIT_LEVEL_MINIMAL_XGMI) changing
> at the same time. And NPS refresh will be ignored. Though might be
> likely in debugging or regression isolation cases which changing driver
> packaged with different TOS.  And yes making NPS refresh checking before
> TOS version checking will solve this.
> 
> And if we do not return ahead when checking NPS request before tOS
> version change in soc15_need_reset_on_init(), we can drop 
> (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) check in below
> refresh checking:
> 

This check is used when NPS request change is identified. During swinit
part it will be at MINIMAL_XGMI level, but at that point there is no
need to refresh this information as reset is pending. It is refreshed
after a reset when init level returns to default.

Thanks,
Lijo

> +     refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&
> +               (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
> 
> refresh = (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
> 
> Thanks
> Feifei
> 
> On 9/26/2024 5:01 PM, Xu, Feifei wrote:
>>>> +     refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&
>>>> +               (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
>> Is there a corner case that reloading with a different version tos and refreshing nps change co-exist?
>>
>> Thanks,
>> Feifei
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Lijo Lazar
>> Sent: Tuesday, September 24, 2024 1:57 PM
>> To: amd-gfx at lists.freedesktop.org
>> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj at amd.com>; Errabolu, Ramesh <Ramesh.Errabolu at amd.com>
>> Subject: [PATCH 6/7] drm/amdgpu: Check gmc requirement for reset on init
>>
>> Add a callback to check if there is any condition detected by GMC block for reset on init. One case is if a pending NPS change request is detected. If reset is done because of NPS switch, refresh NPS info from discovery table.
>>
>> Signed-off-by: Lijo Lazar <lijo.lazar at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 13 ++++++++++++-  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  5 +++++
>>  drivers/gpu/drm/amd/amdgpu/soc15.c      |  2 ++
>>  3 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 21f1e65c9dc9..011fe3a847d0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -1261,12 +1261,15 @@ int amdgpu_gmc_get_nps_memranges(struct amdgpu_device *adev,
>>         struct amdgpu_gmc_memrange *ranges;
>>         int range_cnt, ret, i, j;
>>         uint32_t nps_type;
>> +       bool refresh;
>>
>>         if (!mem_ranges)
>>                 return -EINVAL;
>>
>> +       refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&
>> +                 (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
>>         ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges,
>> -                                           &range_cnt, false);
>> +                                           &range_cnt, refresh);
>>
>>         if (ret)
>>                 return ret;
>> @@ -1392,3 +1395,11 @@ void amdgpu_gmc_prepare_nps_mode_change(struct amdgpu_device *adev)
>>                         adev->dev,
>>                         "NPS mode change request done, reload driver to complete the change\n");  }
>> +
>> +bool amdgpu_gmc_need_reset_on_init(struct amdgpu_device *adev) {
>> +       if (adev->gmc.gmc_funcs->need_reset_on_init)
>> +               return adev->gmc.gmc_funcs->need_reset_on_init(adev);
>> +
>> +       return false;
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> index b13d6adb5efd..d4cd247fe574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> @@ -78,6 +78,8 @@ enum amdgpu_memory_partition {
>>          BIT(AMDGPU_NPS3_PARTITION_MODE) | BIT(AMDGPU_NPS4_PARTITION_MODE) | \
>>          BIT(AMDGPU_NPS6_PARTITION_MODE) | BIT(AMDGPU_NPS8_PARTITION_MODE))
>>
>> +#define AMDGPU_GMC_INIT_RESET_NPS  BIT(0)
>> +
>>  /*
>>   * GMC page fault information
>>   */
>> @@ -169,6 +171,7 @@ struct amdgpu_gmc_funcs {
>>         /* Request NPS mode */
>>         int (*request_mem_partition_mode)(struct amdgpu_device *adev,
>>                                           int nps_mode);
>> +       bool (*need_reset_on_init)(struct amdgpu_device *adev);
>>  };
>>
>>  struct amdgpu_xgmi_ras {
>> @@ -314,6 +317,7 @@ struct amdgpu_gmc {
>>         const struct amdgpu_gmc_funcs   *gmc_funcs;
>>         enum amdgpu_memory_partition    requested_nps_mode;
>>         uint32_t supported_nps_modes;
>> +       uint32_t reset_flags;
>>
>>         struct amdgpu_xgmi xgmi;
>>         struct amdgpu_irq_src   ecc_irq;
>> @@ -468,5 +472,6 @@ int amdgpu_gmc_get_nps_memranges(struct amdgpu_device *adev,  int amdgpu_gmc_request_memory_partition(struct amdgpu_device *adev,
>>                                         int nps_mode);
>>  void amdgpu_gmc_prepare_nps_mode_change(struct amdgpu_device *adev);
>> +bool amdgpu_gmc_need_reset_on_init(struct amdgpu_device *adev);
>>
>>  #endif
>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> index 619933f252aa..97ca4931a7ef 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>> @@ -833,6 +833,8 @@ static bool soc15_need_reset_on_init(struct amdgpu_device *adev)
>>
>>         if (amdgpu_psp_tos_reload_needed(adev))
>>                 return true;
>> +       if (amdgpu_gmc_need_reset_on_init(adev))
>> +               return true;
>>         /* Just return false for soc15 GPUs.  Reset does not seem to
>>          * be necessary.
>>          */
>> --
>> 2.25.1
>>


More information about the amd-gfx mailing list