[Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

Fri Jan 31 00:53:44 UTC 2020

[AMD Official Use Only - Internal Distribution Only]

Hi Felix,

See one inline comment

Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Felix Kuehling
Sent: Thursday, January 30, 2020 6:24 PM
To: Alex Deucher <alexdeucher at gmail.com>
Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj at amd.com>; amd-gfx list <amd-gfx at lists.freedesktop.org>
Subject: Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

On 2020-01-30 17:11, Alex Deucher wrote:
> On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling <felix.kuehling at amd.com> wrote:
>> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
>>> Hello Felix,
>>>
>>> Thanks for your time to review and for your feedback.
>>>
>>> On 1/29/2020 5:52 PM, Felix Kuehling wrote:
>>>> HI Rajneesh,
>>>>
>>>> See comments inline ...
>>>>
>>>> And a general question: Why do you need to set the 
>>>> autosuspend_delay in so many places? Amdgpu only has a single call 
>>>> to this function during initialization.
>>>
>>> We don't. I have called this out in cover letter since i too feel 
>>> its not the best way to do what we want to do. I have seen weird 
>>> issues on dual GPUs with KFDTest that sometimes results in system 
>>> hang and gpu hang etc.
>>>
>>> Even if i try with running kfdtest on just one node in a multi gpu 
>>> system, the baco exit seems to wake up all gpus and then one goes to 
>>> auto suspend again while other performs the desired kfdtest operation.
>>> I have never seen any issue with a single gpu card. So in current 
>>> approch, i am making sure that the GPUs are not quickly 
>>> auto-suspended while the kfd operations are ongoing but once the 
>>> kfdtest is finished, the runtime suspend call with ensure to reset 
>>> the timeout back to 5 seconds.
>> So by setting the timeout to 60s, you can fix applications that run 
>> for less than 60s without calling into KFD. What about applications 
>> that run for longer without calling into KFD? This doesn't sound like 
>> a solution, it's just a hack.
>>
>>
>>>
>>> I would like to avoid this and appreciate some pointers where i can 
>>> put get_sync calls while unbinding. I have tried a number of places 
>>> but saw some issues. So any help will be highly appreciated, to 
>>> identify the proper logical inverse of kfd_bind_process_to_device.
>> kfd_bind_process_to_device is called when KFD starts using a device. 
>> It only stops using that device when the process is destroyed. 
>> Therefore I recommended moving the cleanup code into 
>> kfd_process_destroy_pdds, which iterates over all devices (PDDs) during process termination.
>>
>> Note that kfd_bind_process_to_device can be called many times for the 
>> same device. If you need to keep usage counters balanced, you need 
>> extra checks to make sure you only do runtime-PM stuff the first time 
>> it's called for a particular device in a particular process.
>>
>> Also, in kfd_process_destroy_pdds you may need to check whether 
>> kfd_bind_process_to_device has been called for a particular PDD 
>> before releasing runtime PM.
> runtimepm already has reference counting.  You can use something like
> pm_runtime_get_sync() or pm_runtime_get_noresume() depending on 
> whether you want to make sure the GPU is powered up or not.  That will 
> increase the usage count on the device for runtime pm.  Then when you 
> want to decrement and possibly suspend the device, call
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is 
> basically what we do in amdgpu to handle device usage.

My point is, kfd_bind_process_to_device can be called many times. If we call pm_runtime_sync_get, we'll need to release an unpredictable number of usage counts later. Therefore I suggested ensuring we only do this the first time kfd_bind_process_to_device is called for a particular process-device.

Similarly, in kfd_process_destroy_pdds we should only release the usage count if we actually took it in kfd_bind_process_to_device.

>    We call
> pm_runtime_get_noresume() in our fence_emit function which is what
> gets called when we submit work to the GPU.  That increments the usage
> count for runpm.  Then we call pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() in our fence_process function.  That gets
> called as each job on the GPU is retired.  For kfd, I think you'll
> want to call pm_runtime_get_sync() in at the start of your ioctl to
> wake the device, then pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
> the start.  Then in your actual ioctl work, you'll want to call
> pm_runtime_get_noresume() everytime a queue gets created and
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
> a queue is destroyed.  That should keep the device awake as long as
> there are any queues around.  Due to stuff like xgmi, it's probably
> best to just do this for all GPUs on the system just in case you have
> any p2p stuff going on.

OK, that's a completely different approach from what I had in mind. 
Basically tying device usage to both ioctls and the existence of user 
mode queues. P2P adds extra complication. Not just XGMI but also PCIe 
P2P where GPU A wants to access GPU B's memory but GPU B doesn't have 
any activity to keep it awake. To simplify this, we could also associate 
usage counters with allocating memory. That way any GPU that has memory 
that may be remotely accessed would be kept awake.

Always keeping all GPUs awake seems wasteful. I think being able to 
power down unused GPUs would be a very useful feature.

I still believe my idea of taking a usage counter in 
kfd_bind_process_to_device and dropping it in kfd_process_destroy_pdds 
is simpler and just as effective.

[Oak]: Kfd_bind_process_to_device is called from many ioctl (q creation, memory map/alloc etc). But destroy_pdd is called per process-device pair. So the pm usage counter is not balanced. To balance it, you can make the get/put usage counter either per process or per-pdd. For example, if you want to do it per pdd, you have 2 options:
1. to make sure only get_sync is called only at the first *actual* bind - I think in kfd_process_device_init_vm, before "return 0", is a place logically work.
2. call get sync at the pdd creation: kfd_create_prcess_device_data
In either option, call put at *each* destroy of pdd.

Regards,
   Felix

>
> Alex
>
>> Regards,
>>     Felix
>>
>>
>>>
>>>> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>>>>> So far the kfd driver implemented same routines for runtime and system
>>>>> wide suspend and resume (s2idle or mem). During system wide suspend the
>>>>> kfd aquires an atomic lock that prevents any more user processes to
>>>>> create queues and interact with kfd driver and amd gpu. This mechanism
>>>>> created problem when amdgpu device is runtime suspended with BACO
>>>>> enabled. Any application that relies on kfd driver fails to load
>>>>> because
>>>>> the driver reports a locked kfd device since gpu is runtime suspended.
>>>>>
>>>>> However, in an ideal case, when gpu is runtime  suspended the kfd
>>>>> driver
>>>>> should be able to:
>>>>>
>>>>>    - auto resume amdgpu driver whenever a client requests compute
>>>>> service
>>>>>    - prevent runtime suspend for amdgpu  while kfd is in use
>>>>>
>>>>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>>>>> runtime runtime power management.
>>>>>
>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31
>>>>> +++++++++++++---------
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>>>>>    6 files changed, 51 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 8609287620ea..314c4a2a0354 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct
>>>>> amdgpu_device *adev,
>>>>>            kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>>    }
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        if (adev->kfd.dev)
>>>>> -        kgd2kfd_suspend(adev->kfd.dev);
>>>>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>>>>    }
>>>>>    -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        int r = 0;
>>>>>          if (adev->kfd.dev)
>>>>> -        r = kgd2kfd_resume(adev->kfd.dev);
>>>>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>>>>          return r;
>>>>>    }
>>>>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>>>>    {
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        return 0;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index 0fa898d30207..3dce4a51e522 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>>>>>    int amdgpu_amdkfd_init(void);
>>>>>    void amdgpu_amdkfd_fini(void);
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>>>>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>    void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>                const void *ih_ring_entry);
>>>>>    void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>                 struct drm_device *ddev,
>>>>>                 const struct kgd2kfd_shared_resources *gpu_resources);
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>>>>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>    int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>    int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>    void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
>>>>> *ih_ring_entry);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index d9e5eac182d3..787d49e9f4de 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>            }
>>>>>        }
>>>>>    -    amdgpu_amdkfd_suspend(adev);
>>>>> +    amdgpu_amdkfd_suspend(adev, fbcon);
>>>> The logic seems inverted here. There are four different pmops
>>>> callbacks that use different values for this parameter:
>>>>
>>>> * amdgpu_pmops_suspend: true
>>>> * amdgpu_pmops_freeze: true
>>>> * amdgpu_pmops_poweroff: true
>>>> * amdgpu_pmops_runtime_suspend: false
>>>>
>>>> It looks like runtime_suspend uses false, so you should pass the
>>>> run_pm parameter as !fbcon.
>>> Ok. will fix in v2.
>>>
>>>
>>>>
>>>>>          amdgpu_ras_suspend(adev);
>>>>>    @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>                }
>>>>>            }
>>>>>        }
>>>>> -    r = amdgpu_amdkfd_resume(adev);
>>>>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
>>>>>        if (r)
>>>>>            return r;
>>>>>    diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 2a9e40131735..e9f00c3a067a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -23,6 +23,7 @@
>>>>>    #include <linux/bsearch.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_device_queue_manager.h"
>>>>>    #include "kfd_pm4_headers_vi.h"
>>>>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>>>    {
>>>>>        if (kfd->init_complete) {
>>>>> -        kgd2kfd_suspend(kfd);
>>>>> +        kgd2kfd_suspend(kfd, true);
>>>>>            device_queue_manager_uninit(kfd->dqm);
>>>>>            kfd_interrupt_exit(kfd);
>>>>>            kfd_topology_remove_device(kfd);
>>>>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>          kfd->dqm->ops.pre_reset(kfd->dqm);
>>>>>    -    kgd2kfd_suspend(kfd);
>>>>> +    kgd2kfd_suspend(kfd, true);
>>>> This should use false. This is not runtime PM.
>>>>
>>>>
>>>>>          kfd_signal_reset_event(kfd);
>>>>>        return 0;
>>>>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>>>>>        return  (atomic_read(&kfd_locked) > 0);
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        if (!kfd->init_complete)
>>>>>            return;
>>>>>    -    /* For first KFD device suspend all the KFD processes */
>>>>> -    if (atomic_inc_return(&kfd_locked) == 1)
>>>>> -        kfd_suspend_all_processes();
>>>>> +    /* for runtime suspend, skip locking kfd */
>>>>> +    if (run_pm) {
>>>> This should be done for the non-runtime PM case: if (!run_pm).
>>>>
>>>>
>>>>> +        /* For first KFD device suspend all the KFD processes */
>>>>> +        if (atomic_inc_return(&kfd_locked) == 1)
>>>>> +            kfd_suspend_all_processes();
>>>>> +    }
>>>>>    +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>>>>>        kfd->dqm->ops.stop(kfd->dqm);
>>>>> -
>>>>>        kfd_iommu_suspend(kfd);
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        int ret, count;
>>>>>    @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>>        if (ret)
>>>>>            return ret;
>>>>>    -    count = atomic_dec_return(&kfd_locked);
>>>>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> -    if (count == 0)
>>>>> -        ret = kfd_resume_all_processes();
>>>>> +    /* for runtime resume, skip unlocking kfd */
>>>>> +    if (run_pm) {
>>>> Same as above.
>>>>
>>>>
>>>>> +        count = atomic_dec_return(&kfd_locked);
>>>>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> +        if (count == 0)
>>>>> +            ret = kfd_resume_all_processes();
>>>>> +    }
>>>>>          return ret;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> index 8d871514671e..6301d77ed3d6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> @@ -25,6 +25,7 @@
>>>>>    #include <linux/slab.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/amd-iommu.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_dbgmgr.h"
>>>>>    #include "kfd_topology.h"
>>>>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct
>>>>> kfd_process *p)
>>>>>        struct kfd_process_device *pdd;
>>>>>          list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>>>>> -        if (pdd->bound == PDD_BOUND)
>>>>> +        if (pdd->bound == PDD_BOUND) {
>>>>>                amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
>>>>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
>>>> I don't think this is the right place. This code should only run on
>>>> APUs with IOMMUv2 enabled. Probably missing a check at the start of
>>>> the function. On kernels with IOMMU disabled, this source file is not
>>>> compiled at all.
>>>>
>>>> I think this code should go into kfd_process_destroy_pdds.
>>>
>>> Ok. will fix in v2. So can we consider this as logical inverse of
>>> bind_process_to_device?
>>>
>>>
>>>>
>>>>> +        }
>>>>>    }
>>>>>      /* Callback for process shutdown invoked by the IOMMU driver */
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 25b90f70aecd..d19d5e97405c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -31,6 +31,7 @@
>>>>>    #include <linux/compat.h>
>>>>>    #include <linux/mman.h>
>>>>>    #include <linux/file.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "amdgpu_amdkfd.h"
>>>>>    #include "amdgpu.h"
>>>>>    @@ -843,15 +844,27 @@ struct kfd_process_device
>>>>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>>>>            return ERR_PTR(-ENOMEM);
>>>>>        }
>>>>>    +    err = pm_runtime_get_sync(dev->ddev->dev);
>>>>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
>>>> Why are you using a different hard-coded delay (60s?) here?
>>>
>>> It needs to be fixed. Explained above the reason for such deviation.
>>>
>>>
>>>> Regards,
>>>>    Felix
>>>>
>>> Thanks
>>> Rajneesh
>>>
>>>
>>>>> +    if (err < 0)
>>>>> +        return ERR_PTR(err);
>>>>> +
>>>>>        err = kfd_iommu_bind_process_to_device(pdd);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>          err = kfd_process_device_init_vm(pdd, NULL);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>    -    return pdd;
>>>>> +out:
>>>>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>>>>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>>>>> +
>>>>> +    if (!err)
>>>>> +        return pdd;
>>>>> +    else
>>>>> +        return ERR_PTR(err);
>>>>>    }
>>>>>      struct kfd_process_device *kfd_get_first_process_device_data(
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Coak.zeng%40amd.com%7C64abe683b4ad4902cf1208d7a5db8dba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637160234717876492&sdata=0TO3FbOJc4xJwSSE9%2B1Wk6VJxg3GsB4gfYh0tTq0n%2FY%3D&reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Coak.zeng%40amd.com%7C64abe683b4ad4902cf1208d7a5db8dba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637160234717876492&sdata=0TO3FbOJc4xJwSSE9%2B1Wk6VJxg3GsB4gfYh0tTq0n%2FY%3D&reserved=0