[PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

Belanger, David David.Belanger at amd.com
Wed Mar 8 15:04:36 UTC 2023


[AMD Official Use Only - General]



> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Wednesday, March 8, 2023 4:08 AM
> To: Belanger, David <David.Belanger at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> exit.
> 
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
> 
> 
> Am 07.03.23 um 16:28 schrieb Belanger, David:
> > [AMD Official Use Only - General]
> >
> >
> > The test case is a python program that will load the driver, do some
> operations, then unload the driver.
> 
> What do you mean with unloading the driver? Removing the module? Or
> destroying the device?
> 

The python program calls a shell script that does "modprobe amdgpu".
Calls some SMI operation to get some events.
Then it calls a shell scripts that does "modprobe -r amdgpu".
Then it exits.

There will be ref on the kfd_process that will remain, which will be released only when mmu_notifier ops->release is called.
This does not get called until the python process ends.

The test program is definitively not the typical use a general user would do.

> > When the driver exists, there is still the python process space around
> holding on the address space.
> > When the python process space exits, the mmu_notifier gets called but the
> driver has already been unloaded.
> >
> > The goal of the fix is to address case where there could be
> > outstanding address space / worker threads for process cleanup that needs
> to be cleared/completed at exit time.
> 
> Yeah and when the module is unloaded this is a completely futile effort.
> 
> The general upstream approach is to take references on the struct device and
> module and prevent unloading as long as those references exists.
> 
> The device might be non-functional any more (because for example of hot
> plug), but the driver should never be unloaded before the python program
> exits.

Thank you for your feedback, I will investigate that approach.

> 
> Regards,
> Christian.
> 
> >
> > Regards,
> > David B.
> >
> >> -----Original Message-----
> >> From: Koenig, Christian <Christian.Koenig at amd.com>
> >> Sent: Tuesday, March 7, 2023 2:05 AM
> >> To: Belanger, David <David.Belanger at amd.com>; amd-
> >> gfx at lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> >> exit.
> >>
> >> Am 06.03.23 um 22:58 schrieb David Belanger:
> >>> Handle case when module is unloaded (kfd_exit) before a process
> >>> space
> >>> (mm_struct) is released.
> >> Well that should never ever happen in the first place. It sounds like
> >> we are missing grabbing module references.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: David Belanger <david.belanger at amd.com>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
> >>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57
> >> ++++++++++++++++++++++++
> >>>    2 files changed, 61 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> index 09b966dc3768..8ef4bd9e4f7d 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> @@ -26,6 +26,9 @@
> >>>    #include "kfd_priv.h"
> >>>    #include "amdgpu_amdkfd.h"
> >>>
> >>> +void kfd_cleanup_processes(void);
> >>> +
> >>> +
> >>>    static int kfd_init(void)
> >>>    {
> >>>     int err;
> >>> @@ -77,6 +80,7 @@ static int kfd_init(void)
> >>>
> >>>    static void kfd_exit(void)
> >>>    {
> >>> +   kfd_cleanup_processes();
> >>>     kfd_debugfs_fini();
> >>>     kfd_process_destroy_wq();
> >>>     kfd_procfs_shutdown();
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> index ebabe92f7edb..b5b28a32639d 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> @@ -1181,6 +1181,17 @@ static void
> >>> kfd_process_notifier_release(struct
> >> mmu_notifier *mn,
> >>>             return;
> >>>
> >>>     mutex_lock(&kfd_processes_mutex);
> >>> +   /*
> >>> +    * Do early return if p is not in the table.
> >>> +    *
> >>> +    * This could potentially happen if this function is called concurrently
> >>> +    * by mmu_notifier and by kfd_cleanup_pocesses.
> >>> +    *
> >>> +    */
> >>> +   if (!hash_hashed(&p->kfd_processes)) {
> >>> +           mutex_unlock(&kfd_processes_mutex);
> >>> +           return;
> >>> +   }
> >>>     hash_del_rcu(&p->kfd_processes);
> >>>     mutex_unlock(&kfd_processes_mutex);
> >>>     synchronize_srcu(&kfd_processes_srcu);
> >>> @@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops
> >> kfd_process_mmu_notifier_ops = {
> >>>     .free_notifier = kfd_process_free_notifier,
> >>>    };
> >>>
> >>> +
> >>> +void kfd_cleanup_processes(void)
> >>> +{
> >>> +   struct kfd_process *p;
> >>> +   unsigned int temp;
> >>> +
> >>> +   /*
> >>> +    * Iterate over remaining processes in table, calling notifier release
> >>> +    * to free up notifier and process resources.
> >>> +    *
> >>> +    * This code handles the case when driver is unloaded before all
> >> mm_struct
> >>> +    * are released.
> >>> +    */
> >>> +   int idx = srcu_read_lock(&kfd_processes_srcu);
> >>> +
> >>> +   hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> >>> +           if (p) {
> >>> +                   /*
> >>> +                    * Obtain a reference on p to avoid a late
> >> mmu_notifier release
> >>> +                    * call triggering freeing the process.
> >>> +                    */
> >>> +
> >>> +                   kref_get(&p->ref);
> >>> +
> >>> +                   srcu_read_unlock(&kfd_processes_srcu, idx);
> >>> +
> >>> +                   kfd_process_notifier_release(&p->mmu_notifier,
> >>> + p-
> >>> mm);
> >>> +
> >>> +                   kfd_unref_process(p);
> >>> +
> >>> +                   idx = srcu_read_lock(&kfd_processes_srcu);
> >>> +           }
> >>> +   }
> >>> +   srcu_read_unlock(&kfd_processes_srcu, idx);
> >>> +
> >>> +   /*
> >>> +    * Must be called after all mmu_notifier_put are done and before
> >>> +    * kfd_process_wq is released.
> >>> +    *
> >>> +    * Ensures that all outstanding free_notifier gets called,
> >>> + triggering the
> >> release
> >>> +    * of the process.
> >>> +    */
> >>> +   mmu_notifier_synchronize();
> >>> +}
> >>> +
> >>> +
> >>>    static int kfd_process_init_cwsr_apu(struct kfd_process *p,
> >>> struct file
> >> *filep)
> >>>    {
> >>>     unsigned long  offset;


More information about the amd-gfx mailing list