[PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

Felix Kuehling felix.kuehling at amd.com
Wed Mar 8 16:17:27 UTC 2023


On 2023-03-08 04:07, Christian König wrote:
> Am 07.03.23 um 16:28 schrieb Belanger, David:
>> [AMD Official Use Only - General]
>>
>>
>> The test case is a python program that will load the driver, do some 
>> operations, then unload the driver.
>
> What do you mean with unloading the driver? Removing the module? Or 
> destroying the device?
>
>> When the driver exists, there is still the python process space 
>> around holding on the address space.
>> When the python process space exits, the mmu_notifier gets called but 
>> the driver has already been unloaded.
>>
>> The goal of the fix is to address case where there could be 
>> outstanding address space / worker threads for process
>> cleanup that needs to be cleared/completed at exit time.
>
> Yeah and when the module is unloaded this is a completely futile effort.
>
> The general upstream approach is to take references on the struct 
> device and module and prevent unloading as long as those references 
> exists.

That's not how it always works. In case of RCU callbacks, the documented 
strategy is to use rcu_barrier in the module exit function to ensure the 
grace period and all callbacks have completed 
(https://www.kernel.org/doc/html/latest/RCU/rcubarrier.html). 
mmu_notifier_synchronize is meant to do something similar for pending 
mmu_notifier_put work 
(https://elixir.bootlin.com/linux/v6.2.2/source/mm/mmu_notifier.c#L1116).

But this implies that we need to call mmu_notifier_put for all the MMU 
notifiers registered by the module first. I think closing /dev/kfd drops 
the module reference count, but the MMU notifiers we register for 
process cleanup persist until the address space is destroyed. We need to 
trigger that cleanup for any processes that still exist in that state 
when the module is unloaded. Or we need to find a way to increment the 
module refcount for every process that registers a KFD cleanup MMU notifier.

Regards,
   Felix


>
>
> The device might be non-functional any more (because for example of 
> hot plug), but the driver should never be unloaded before the python 
> program exits.
>
> Regards,
> Christian.
>
>>
>> Regards,
>> David B.
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Tuesday, March 7, 2023 2:05 AM
>>> To: Belanger, David <David.Belanger at amd.com>; amd-
>>> gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
>>> exit.
>>>
>>> Am 06.03.23 um 22:58 schrieb David Belanger:
>>>> Handle case when module is unloaded (kfd_exit) before a process space
>>>> (mm_struct) is released.
>>> Well that should never ever happen in the first place. It sounds 
>>> like we are
>>> missing grabbing module references.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: David Belanger <david.belanger at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57
>>> ++++++++++++++++++++++++
>>>>    2 files changed, 61 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
>>>> index 09b966dc3768..8ef4bd9e4f7d 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
>>>> @@ -26,6 +26,9 @@
>>>>    #include "kfd_priv.h"
>>>>    #include "amdgpu_amdkfd.h"
>>>>
>>>> +void kfd_cleanup_processes(void);
>>>> +
>>>> +
>>>>    static int kfd_init(void)
>>>>    {
>>>>        int err;
>>>> @@ -77,6 +80,7 @@ static int kfd_init(void)
>>>>
>>>>    static void kfd_exit(void)
>>>>    {
>>>> +    kfd_cleanup_processes();
>>>>        kfd_debugfs_fini();
>>>>        kfd_process_destroy_wq();
>>>>        kfd_procfs_shutdown();
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index ebabe92f7edb..b5b28a32639d 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1181,6 +1181,17 @@ static void kfd_process_notifier_release(struct
>>> mmu_notifier *mn,
>>>>            return;
>>>>
>>>>        mutex_lock(&kfd_processes_mutex);
>>>> +    /*
>>>> +     * Do early return if p is not in the table.
>>>> +     *
>>>> +     * This could potentially happen if this function is called 
>>>> concurrently
>>>> +     * by mmu_notifier and by kfd_cleanup_pocesses.
>>>> +     *
>>>> +     */
>>>> +    if (!hash_hashed(&p->kfd_processes)) {
>>>> +        mutex_unlock(&kfd_processes_mutex);
>>>> +        return;
>>>> +    }
>>>>        hash_del_rcu(&p->kfd_processes);
>>>>        mutex_unlock(&kfd_processes_mutex);
>>>>        synchronize_srcu(&kfd_processes_srcu);
>>>> @@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops
>>> kfd_process_mmu_notifier_ops = {
>>>>        .free_notifier = kfd_process_free_notifier,
>>>>    };
>>>>
>>>> +
>>>> +void kfd_cleanup_processes(void)
>>>> +{
>>>> +    struct kfd_process *p;
>>>> +    unsigned int temp;
>>>> +
>>>> +    /*
>>>> +     * Iterate over remaining processes in table, calling notifier 
>>>> release
>>>> +     * to free up notifier and process resources.
>>>> +     *
>>>> +     * This code handles the case when driver is unloaded before all
>>> mm_struct
>>>> +     * are released.
>>>> +     */
>>>> +    int idx = srcu_read_lock(&kfd_processes_srcu);
>>>> +
>>>> +    hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>> +        if (p) {
>>>> +            /*
>>>> +             * Obtain a reference on p to avoid a late
>>> mmu_notifier release
>>>> +             * call triggering freeing the process.
>>>> +             */
>>>> +
>>>> +            kref_get(&p->ref);
>>>> +
>>>> +            srcu_read_unlock(&kfd_processes_srcu, idx);
>>>> +
>>>> + kfd_process_notifier_release(&p->mmu_notifier, p-
>>>> mm);
>>>> +
>>>> +            kfd_unref_process(p);
>>>> +
>>>> +            idx = srcu_read_lock(&kfd_processes_srcu);
>>>> +        }
>>>> +    }
>>>> +    srcu_read_unlock(&kfd_processes_srcu, idx);
>>>> +
>>>> +    /*
>>>> +     * Must be called after all mmu_notifier_put are done and before
>>>> +     * kfd_process_wq is released.
>>>> +     *
>>>> +     * Ensures that all outstanding free_notifier gets called, 
>>>> triggering the
>>> release
>>>> +     * of the process.
>>>> +     */
>>>> +    mmu_notifier_synchronize();
>>>> +}
>>>> +
>>>> +
>>>>    static int kfd_process_init_cwsr_apu(struct kfd_process *p, 
>>>> struct file
>>> *filep)
>>>>    {
>>>>        unsigned long  offset;
>


More information about the amd-gfx mailing list