[PATCH v3 5/5] drm/amdgpu: switch workload context to/from compute

Lazar, Lijo lijo.lazar at amd.com
Thu Sep 29 13:37:32 UTC 2022



On 9/29/2022 6:50 PM, Sharma, Shashank wrote:
> 
> 
> On 9/29/2022 1:10 PM, Lazar, Lijo wrote:
>>
>>
>> On 9/29/2022 2:18 PM, Sharma, Shashank wrote:
>>>
>>>
>>> On 9/28/2022 11:51 PM, Alex Deucher wrote:
>>>> On Wed, Sep 28, 2022 at 4:57 AM Sharma, Shashank
>>>> <shashank.sharma at amd.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 9/27/2022 10:40 PM, Alex Deucher wrote:
>>>>>> On Tue, Sep 27, 2022 at 11:38 AM Sharma, Shashank
>>>>>> <shashank.sharma at amd.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/27/2022 5:23 PM, Felix Kuehling wrote:
>>>>>>>> Am 2022-09-27 um 10:58 schrieb Sharma, Shashank:
>>>>>>>>> Hello Felix,
>>>>>>>>>
>>>>>>>>> Thank for the review comments.
>>>>>>>>>
>>>>>>>>> On 9/27/2022 4:48 PM, Felix Kuehling wrote:
>>>>>>>>>> Am 2022-09-27 um 02:12 schrieb Christian König:
>>>>>>>>>>> Am 26.09.22 um 23:40 schrieb Shashank Sharma:
>>>>>>>>>>>> This patch switches the GPU workload mode to/from
>>>>>>>>>>>> compute mode, while submitting compute workload.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>>>>>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>>>>>>>>>>
>>>>>>>>>>> Feel free to add my acked-by, but Felix should probably take 
>>>>>>>>>>> a look
>>>>>>>>>>> as well.
>>>>>>>>>>
>>>>>>>>>> This look OK purely from a compute perspective. But I'm concerned
>>>>>>>>>> about the interaction of compute with graphics or multiple 
>>>>>>>>>> graphics
>>>>>>>>>> contexts submitting work concurrently. They would constantly 
>>>>>>>>>> override
>>>>>>>>>> or disable each other's workload hints.
>>>>>>>>>>
>>>>>>>>>> For example, you have an amdgpu_ctx with
>>>>>>>>>> AMDGPU_CTX_WORKLOAD_HINT_COMPUTE (maybe Vulkan compute) and a KFD
>>>>>>>>>> process that also wants the compute profile. Those could be 
>>>>>>>>>> different
>>>>>>>>>> processes belonging to different users. Say, KFD enables the 
>>>>>>>>>> compute
>>>>>>>>>> profile first. Then the graphics context submits a job. At the 
>>>>>>>>>> start
>>>>>>>>>> of the job, the compute profile is enabled. That's a no-op 
>>>>>>>>>> because
>>>>>>>>>> KFD already enabled the compute profile. When the job 
>>>>>>>>>> finishes, it
>>>>>>>>>> disables the compute profile for everyone, including KFD. That's
>>>>>>>>>> unexpected.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In this case, it will not disable the compute profile, as the
>>>>>>>>> reference counter will not be zero. The reset_profile() will 
>>>>>>>>> only act
>>>>>>>>> if the reference counter is 0.
>>>>>>>>
>>>>>>>> OK, I missed the reference counter.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But I would be happy to get any inputs about a policy which can be
>>>>>>>>> more sustainable and gets better outputs, for example:
>>>>>>>>> - should we not allow a profile change, if a PP mode is already
>>>>>>>>> applied and keep it Early bird basis ?
>>>>>>>>>
>>>>>>>>> For example: Policy A
>>>>>>>>> - Job A sets the profile to compute
>>>>>>>>> - Job B tries to set profile to 3D, but we do not allow it as 
>>>>>>>>> job A is
>>>>>>>>> not finished it yet.
>>>>>>>>>
>>>>>>>>> Or Policy B: Current one
>>>>>>>>> - Job A sets the profile to compute
>>>>>>>>> - Job B tries to set profile to 3D, and we allow it. Job A also 
>>>>>>>>> runs
>>>>>>>>> in PP 3D
>>>>>>>>> - Job B finishes, but does not reset PP as reference count is 
>>>>>>>>> not zero
>>>>>>>>> due to compute
>>>>>>>>> - Job  A finishes, profile reset to NONE
>>>>>>>>
>>>>>>>> I think this won't work. As I understand it, the
>>>>>>>> amdgpu_dpm_switch_power_profile enables and disables individual
>>>>>>>> profiles. Disabling the 3D profile doesn't disable the compute 
>>>>>>>> profile
>>>>>>>> at the same time. I think you'll need one refcount per profile.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>      Felix
>>>>>>>
>>>>>>> Thanks, This is exactly what I was looking for, I think Alex's 
>>>>>>> initial
>>>>>>> idea was around it, but I was under the assumption that there is 
>>>>>>> only
>>>>>>> one HW profile in SMU which keeps on getting overwritten. This 
>>>>>>> can solve
>>>>>>> our problems, as I can create an array of reference counters, and 
>>>>>>> will
>>>>>>> disable only the profile whose reference counter goes 0.
>>>>>>
>>>>>> It's been a while since I paged any of this code into my head, but I
>>>>>> believe the actual workload message in the SMU is a mask where you 
>>>>>> can
>>>>>> specify multiple workload types at the same time and the SMU will
>>>>>> arbitrate between them internally.  E.g., the most aggressive one 
>>>>>> will
>>>>>> be selected out of the ones specified.  I think in the driver we just
>>>>>> set one bit at a time using the current interface.  It might be 
>>>>>> better
>>>>>> to change the interface and just ref count the hint types and then
>>>>>> when we call the set function look at the ref counts for each hint
>>>>>> type and set the mask as appropriate.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>
>>>>> Hey Alex,
>>>>> Thanks for your comment, if that is the case, this current patch 
>>>>> series
>>>>> works straight forward, and no changes would be required. Please 
>>>>> let me
>>>>> know if my understanding is correct:
>>>>>
>>>>> Assumption: Order of aggression: 3D > Media > Compute
>>>>>
>>>>> - Job 1: Requests mode compute: PP changed to compute, ref count 1
>>>>> - Job 2: Requests mode media: PP changed to media, ref count 2
>>>>> - Job 3: requests mode 3D: PP changed to 3D, ref count 3
>>>>> - Job 1 finishes, downs ref count to 2, doesn't reset the PP as ref 
>>>>> > 0,
>>>>> PP still 3D
>>>>> - Job 3 finishes, downs ref count to 1, doesn't reset the PP as ref 
>>>>> > 0,
>>>>> PP still 3D
>>>>> - Job 2 finishes, downs ref count to 0, PP changed to NONE,
>>>>>
>>>>> In this way, every job will be operating in the Power profile of 
>>>>> desired
>>>>> aggression or higher, and this API guarantees the execution 
>>>>> at-least in
>>>>> the desired power profile.
>>>>
>>>> I'm not entirely sure on the relative levels of aggression, but I
>>>> believe the SMU priorities them by index.  E.g.
>>>> #define WORKLOAD_PPLIB_DEFAULT_BIT        0
>>>> #define WORKLOAD_PPLIB_FULL_SCREEN_3D_BIT 1
>>>> #define WORKLOAD_PPLIB_POWER_SAVING_BIT   2
>>>> #define WORKLOAD_PPLIB_VIDEO_BIT          3
>>>> #define WORKLOAD_PPLIB_VR_BIT             4
>>>> #define WORKLOAD_PPLIB_COMPUTE_BIT        5
>>>> #define WORKLOAD_PPLIB_CUSTOM_BIT         6
>>>>
>>>> 3D < video < VR < compute < custom
>>>>
>>>> VR and compute are the most aggressive.  Custom takes preference
>>>> because it's user customizable.
>>>>
>>>> Alex
>>>>
>>>
>>> Thanks, so this UAPI will guarantee the execution of the job in 
>>> atleast the requested power profile, or a more aggressive one.
>>>
>>
>> Hi Shashank,
>>
>> This is not how the API works in the driver PM subsystem. In the final 
>> interface with PMFW, driver sets only one profile bit and doesn't set 
>> any mask. So it doesn't work the way as Felix explained. If there is 
>> more than one profile bit set, PMFW looks at the mask and picks the 
>> one with the highest priority. 
> Note that for each update of workload mask,
>> PMFW should get a message.
>>
>> Driver currently sets only bit as Alex explained earlier. For our 
>> current driver implementation, you can check this as example -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c#L1753 
> 
> If you see my last reply, Since Alex's last message, we are very clear 
> on this point. And also as PM FW is already picking up the one with the 
> highest priority, we don't have to worry about blocking profile change 
> calls via different contexts. In this way, every job will be executed at 
> at-least the requested priority power profile, or more than that.
> 
> current power profile P0.
> Job1 came, requested power profile P2=> PM FW changed profile to P2.
> Job2 came, requested power profile P3=> if (p3 > p2): profile changed to 
> P3, else it will stay at P2. So Job2 will still execute at P2, which is 
> more aggressive than P3.
> 

To be clear your understanding -

Nothing is automatic in PMFW. PMFW picks a priority based on the actual 
mask sent by driver.

Assuming lower bits corresponds to highest priority -

If driver sends a mask with Bit3 and Bit 0 set, PMFW will chose profile 
that corresponds to Bit0. If driver sends a mask with Bit4 Bit2 set and 
rest unset, PMFW will chose profile that corresponds to Bit2. However if 
driver sends a mask only with a single bit set, it chooses the profile 
regardless of whatever was the previous profile. t doesn't check if the 
existing profile > newly requested one. That is the behavior.

So if a job send chooses a profile that corresponds to Bit0, driver will 
send that. Next time if another job chooses a profile that corresponds 
to Bit1, PMFW will receive that as the new profile and switch to that. 
It trusts the driver to send the proper workload mask.

Hope that gives the picture.

Thanks,
Lijo

> So we don't have to block the PP change request at all.
> 
>>
>> Also, PM layer already stores the current workload profile for a *get* 
>> API (which also means a new pm workload variable is not needed). But, 
>> that API works as long as driver sets only one profile bit, that way 
>> driver is sure of the current profile mode -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c#L1628 
>>
> 
> Yes, I had seen this API and its high level API while I was exploring 
> the code, and I found this written to support sysfs based reads and 
> write, and is not useful for a context based scenario.
> 
>>
>> When there is more than one, driver is not sure of the internal 
>> priority of PMFW though we can follow the bit order which Alex 
>> suggested (but sometimes FW carry some workarounds inside which means 
>> it doesn't necessarily follow the same order).
>>
>> There is an existing interface through sysfs through which allow to 
>> change the profile mode and add custom settings. 
> 
> Same as above, this sysfs interface is very basic, and good for 
> validation of power profile change, but not for job level pp change.
> 
> In summary, any
>> handling of change from single bit to mask needs to be done at the 
>> lower layer.
>>
> 
> I still don't understand how does this series handle and change this 
> mask ? This part is still being done in 
> amdgpu_dpm_switch_power_profile() function, which is a dpm function 
> only. Code in this series is just calling/consuming this function from 
> the scheduler.
> 
>> The problem is this behavior has been there throughout all legacy 
>> ASICs. Not sure how much of effort it takes and what all needs to be 
>> modified.
>>
> 
> As mentioned above, we are just consuming 
> amdgpu_dpm_switch_power_profile() function. So if this function is valid 
> for all these ASICs, I think this wrapper will also be fine.
> 
> - Shashank
> 
>> Thanks,
>> Lijo
>>
>>> I will do the one change required and send the updated one.
>>>
>>> - Shashank
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> - Shashank
>>>>>
>>>>>>
>>>>>>>
>>>>>>> - Shashank
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Or anything else ?
>>>>>>>>>
>>>>>>>>> REgards
>>>>>>>>> Shashank
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Or you have multiple VCN contexts. When context1 finishes a 
>>>>>>>>>> job, it
>>>>>>>>>> disables the VIDEO profile. But context2 still has a job on 
>>>>>>>>>> the other
>>>>>>>>>> VCN engine and wants the VIDEO profile to still be enabled.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>      Felix
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 14 
>>>>>>>>>>>> +++++++++++---
>>>>>>>>>>>>     1 file changed, 11 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>> index 5e53a5293935..1caed319a448 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>> @@ -34,6 +34,7 @@
>>>>>>>>>>>>     #include "amdgpu_ras.h"
>>>>>>>>>>>>     #include "amdgpu_umc.h"
>>>>>>>>>>>>     #include "amdgpu_reset.h"
>>>>>>>>>>>> +#include "amdgpu_ctx_workload.h"
>>>>>>>>>>>>       /* Total memory size in system memory and all GPU 
>>>>>>>>>>>> VRAM. Used to
>>>>>>>>>>>>      * estimate worst case amount of memory to reserve for 
>>>>>>>>>>>> page tables
>>>>>>>>>>>> @@ -703,9 +704,16 @@ int amdgpu_amdkfd_submit_ib(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>       void amdgpu_amdkfd_set_compute_idle(struct 
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>> bool idle)
>>>>>>>>>>>>     {
>>>>>>>>>>>> -    amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>>> -                    PP_SMC_POWER_PROFILE_COMPUTE,
>>>>>>>>>>>> -                    !idle);
>>>>>>>>>>>> +    int ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    if (idle)
>>>>>>>>>>>> +        ret = amdgpu_clear_workload_profile(adev,
>>>>>>>>>>>> AMDGPU_CTX_WORKLOAD_HINT_COMPUTE);
>>>>>>>>>>>> +    else
>>>>>>>>>>>> +        ret = amdgpu_set_workload_profile(adev,
>>>>>>>>>>>> AMDGPU_CTX_WORKLOAD_HINT_COMPUTE);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    if (ret)
>>>>>>>>>>>> +        drm_warn(&adev->ddev, "Failed to %s power profile to
>>>>>>>>>>>> compute mode\n",
>>>>>>>>>>>> +             idle ? "reset" : "set");
>>>>>>>>>>>>     }
>>>>>>>>>>>>       bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device 
>>>>>>>>>>>> *adev, u32
>>>>>>>>>>>> vmid)
>>>>>>>>>>>


More information about the amd-gfx mailing list