[PATCH 1/1] amdgpu fix for gfx1103 queue evict/restore crash

Felix Kuehling felix.kuehling at amd.com
Fri Nov 29 17:21:12 UTC 2024


On 2024-11-28 21:51, Mika Laitio wrote:
> Thanks for the feedback, the problem is anyway real breaking userspace 
> apps if my patch is not in use. I have actually spend this day for 
> investigating and testing another gpu hang bug that has been reported 
> originally by others on gfx1010/AMD RX 5700. I thought originally that 
> the bug is different because I was not able to trigger it in the test 
> app that crashes the kernel on gfx1103.
>
> With gfx1010 I need to run the pytorch gpu benchmark which does more 
> heavy calculation. In kernel side the symptom is same, kernel fails to 
> remove the queue on similar type of evict/restore cycle that the 
> kernel seems to do constantly. This bug has one annoying side-effect, 
> regular user level reboot will hang requiring to use power button to 
> shut down the device. (echo b >/proc/sysrq-trigger works sometimes)
>
> Anyway, I have managed to get the gfx1010 to also stay stable and 
> finish the benchmarks if I do a similar type of fix/workaround that 
> prevents the queue remove/restore to happen on evict and restore methods.
>
> It may or may not be in reality a firmware bug, hard to debug as I do 
> not access to firmware code. But I think this should be fixed somehow 
> anyway. (Kernel has tons of workaround anyway for other broken 
> firmware and hw problems)
>
> I can however try to approach this in some other way also, would you 
> have any suggestion? I have played with the recent AMD gpu kernel 
> driver stack for a couple of days, so I probably miss something but 
> here are 2 observations/questions I have in my mind?
>
> 1) Is it really necessary to evict/restore the queues also on firmware 
> until they really need to be deleted more permanently? I mean would it 
> be just enough to mark queues disabled/enabled in kernel-structure 
> when pre-emption happens?

The purpose of the preemption is to stop GPU access to memory. Just 
marking the queue as preempted does not accomplish this.

If we don't stop GPU access to memory, the GPU can corrupt physical 
memory that no longer belongs to the process. This will very likely lead 
to problems like random crashes, file system corruption, etc. That's not 
a valid solution to your problem.


>
> 2) dqm_lock that is used to protect the queue-lists that are 
> removed/restored uses memalloc_noreclaim_save/restore calls that 
> according to documentation can easily cause problems if there happens 
> some fs calls or recursions. Could the userspace be able to trigger 
> that problem by using some amdgpu specific sysfs interface calls. Or 
> can the MES firmware somehow call back to kernel functions that cause 
> recursive loop while performing the queue remove method calls?

MES firmware does not call back into the kernel. The kernel mode driver 
handles the locking and is careful not to cause recursions with memory 
reclaim. That's what these memalloc_noreclaim_save are there for. The 
kernel has some lock dependency debugging features that can prove 
locking correctness. We usually have those enabled in developer builds 
to ensure that we're handling this correctly.

The error messages you're showing do not point at a locking issue. 
Rather the firmware is reporting that it's unable to preempt the queue. 
The driver responds with a GPU reset.

Regards,
   Felix


>
> Below is the gfx1010 dmesg with added trace calls that reveals kernel 
> problems with queues while using that device.
> I have again added some extra strace to to print out the function name 
> when its started and what is the caller method.
>
> 884.437695] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  884.437704] amdgpu: evict_process_queues_cpsch started
> [  884.443511] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  884.443520] amdgpu: restore_process_queues_cpsch started
> [  907.375917] amdgpu: evict_process_queues_cpsch started
> [  907.375981] amdgpu: evict_process_worker Finished evicting pasid 0x8005
> [  907.483535] amdgpu: restore_process_queues_cpsch started
> [  909.013279] amdgpu: kgd2kfd_quiesce_mm called by svm_range_evict
> [  909.013286] amdgpu: evict_process_queues_cpsch started
> [  909.033675] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  909.033681] amdgpu: evict_process_queues_cpsch started
> [  909.059674] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  909.059680] amdgpu: restore_process_queues_cpsch started
> [  909.082565] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  909.082572] amdgpu: evict_process_queues_cpsch started
> [  909.295184] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  909.295190] amdgpu: restore_process_queues_cpsch started
> [  909.608840] amdgpu: kgd2kfd_resume_mm called by svm_range_restore_work
> [  909.608846] amdgpu: restore_process_queues_cpsch started
> [  966.354867] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  966.354876] amdgpu: evict_process_queues_cpsch started
> [  966.361293] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  966.361303] amdgpu: restore_process_queues_cpsch started
> [  984.457200] amdgpu: evict_process_queues_cpsch started
> [  984.457261] amdgpu: evict_process_worker Finished evicting pasid 0x8005
> [  984.562403] amdgpu: restore_process_queues_cpsch started
> [  984.628620] amdgpu: kgd2kfd_quiesce_mm called by svm_range_evict
> [  984.628627] amdgpu: evict_process_queues_cpsch started
> [  984.650436] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  984.650443] amdgpu: evict_process_queues_cpsch started
> [  984.718544] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  984.718550] amdgpu: restore_process_queues_cpsch started
> [  984.738360] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  984.738367] amdgpu: evict_process_queues_cpsch started
> [  984.765031] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  984.765038] amdgpu: restore_process_queues_cpsch started
> [  984.785180] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  984.785187] amdgpu: evict_process_queues_cpsch started
> [  984.907430] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  984.907435] amdgpu: restore_process_queues_cpsch started
> [  984.930399] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  984.930405] amdgpu: evict_process_queues_cpsch started
> [  984.956551] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  984.956561] amdgpu: restore_process_queues_cpsch started
> [  985.288614] amdgpu: kgd2kfd_resume_mm called by svm_range_restore_work
> [  985.288621] amdgpu: restore_process_queues_cpsch started
> [  998.410978] amdgpu: evict_process_queues_cpsch started
> [  998.411041] amdgpu: evict_process_worker Finished evicting pasid 0x8005
> [  998.513922] amdgpu: restore_process_queues_cpsch started
> [  998.531861] amdgpu: kgd2kfd_quiesce_mm called by svm_range_evict
> [  998.531867] amdgpu: evict_process_queues_cpsch started
> [  998.553650] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  998.553656] amdgpu: evict_process_queues_cpsch started
> [  998.581235] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  998.581241] amdgpu: restore_process_queues_cpsch started
> [  998.607168] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  998.607174] amdgpu: evict_process_queues_cpsch started
> [  998.700499] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  998.700506] amdgpu: restore_process_queues_cpsch started
> [  998.718179] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  998.718187] amdgpu: evict_process_queues_cpsch started
> [  998.810595] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  998.810603] amdgpu: restore_process_queues_cpsch started
> [  998.831776] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  998.831782] amdgpu: evict_process_queues_cpsch started
> [  998.858199] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  998.858205] amdgpu: restore_process_queues_cpsch started
> [  998.880604] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [  998.880611] amdgpu: evict_process_queues_cpsch started
> [  998.912335] amdgpu: kgd2kfd_resume_mm called by 
> amdgpu_amdkfd_restore_userptr_worker
> [  998.912343] amdgpu: restore_process_queues_cpsch started
> [  999.237449] amdgpu: kgd2kfd_resume_mm called by svm_range_restore_work
> [  999.237455] amdgpu: restore_process_queues_cpsch started
> [ 1058.513361] amdgpu: kgd2kfd_quiesce_mm called by 
> amdgpu_amdkfd_evict_userptr
> [ 1058.513373] amdgpu: evict_process_queues_cpsch started
> [ 1062.513487] amdgpu 0000:03:00.0: amdgpu: Queue preemption failed 
> for queue with doorbell_id: 80004008
> [ 1062.513500] amdgpu 0000:03:00.0: amdgpu: Failed to evict process 
> queue 0, caller: kgd2kfd_quiesce_mm
> [ 1062.513503] amdgpu: Failed to quiesce KFD
> [ 1062.513551] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
> [ 1062.513628] amdgpu: evict_process_queues_cpsch started
> [ 1062.513694] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
> [ 1062.517229] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
> [ 1062.866910] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper 
> [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
> [ 1062.867435] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
> [ 1062.915075] amdgpu 0000:03:00.0: amdgpu: BACO reset
> [ 1062.937902] amdgpu: kgd2kfd_quiesce_mm called by svm_range_evict
> [ 1062.937907] amdgpu: evict_process_queues_cpsch started
>
>
>
>
> On Wed, Nov 27, 2024 at 3:50 PM Felix Kuehling 
> <felix.kuehling at amd.com> wrote:
>
>
>     On 2024-11-27 06:51, Christian König wrote:
>     > Am 27.11.24 um 12:46 schrieb Mika Laitio:
>     >> AMD gfx1103 / M780 iGPU will crash eventually when used for
>     >> pytorch ML/AI operations on rocm sdk stack. After kernel error
>     >> the application exits on error and linux desktop can itself
>     >> sometimes either freeze or reset back to login screen.
>     >>
>     >> Error will happen randomly when kernel calls
>     >> evict_process_queues_cpsch and
>     >> restore_process_queues_cpsch methods to remove and restore the
>     queues
>     >> that has been created earlier.
>     >>
>     >> The fix is to remove the evict and restore calls when device
>     used is
>     >> iGPU. The queues that has been added during the user space
>     >> application execution
>     >> time will still be removed when the application exits
>     >
>     > As far as I can see that is absolutely not a fix but rather a
>     > obviously broken workaround.
>     >
>     > Evicting and restoring queues is usually mandatory for correct
>     operation.
>     >
>     > So just ignore that this doesn't work will just is not something
>     you
>     > can do.
>
>     I agree. Eviction happens for example in MMU notifiers where we
>     need to
>     assure the kernel that memory won't be accessed by the GPU once the
>     notifier returns, until the memory mappings in the GPU page tables
>     can
>     be revalidated.
>
>     This looks like a crude workaround for an MES firmware problem or
>     some
>     other kind of intermittent hang that needs to be root-caused. It's a
>     NACK from me as well.
>
>     Regards,
>        Felix
>
>
>     >
>     > Regards,
>     > Christian.
>     >
>     >>
>     >> On evety test attempts the crash has always happened on the
>     >> same location while removing the 2nd queue of 3 with doorbell
>     id 0x1002.
>     >>
>     >> Below is the trace captured by adding more printouts to problem
>     >> location to print message also when the queue is evicted or
>     resrored
>     >> succesfully.
>     >>
>     >> [  948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1202, queue: 2, caller:
>     >> restore_process_queues_cpsch
>     >> [  948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1002, queue: 1, caller:
>     >> restore_process_queues_cpsch
>     >> [  948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1000, queue: 0, caller:
>     >> restore_process_queues_cpsch
>     >> [  952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes
>     removed
>     >> hardware queue from MES, doorbell=0x1202, queue: 2, caller:
>     >> evict_process_queues_cpsch
>     >> [  952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes
>     removed
>     >> hardware queue from MES, doorbell=0x1002, queue: 1, caller:
>     >> evict_process_queues_cpsch
>     >> [  952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes
>     removed
>     >> hardware queue from MES, doorbell=0x1000, queue: 0, caller:
>     >> evict_process_queues_cpsch
>     >> [  952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1202, queue: 2, caller:
>     >> restore_process_queues_cpsch
>     >> [  952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1002, queue: 1, caller:
>     >> restore_process_queues_cpsch
>     >> [  952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added
>     >> hardware queue to MES, doorbell=0x1000, queue: 0, caller:
>     >> restore_process_queues_cpsch
>     >> [  952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes
>     removed
>     >> hardware queue from MES, doorbell=0x1202, queue: 2, caller:
>     >> evict_process_queues_cpsch
>     >> [  954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to
>     respond to
>     >> msg=REMOVE_QUEUE
>     >> [  954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes
>     failed
>     >> to remove hardware queue from MES, doorbell=0x1002, queue: 1,
>     caller:
>     >> evict_process_queues_cpsch
>     >> [  954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in
>     >> unrecoverable state, issue a GPU reset
>     >> [  954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1
>     >> [  954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict
>     process
>     >> queues
>     >> [  954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
>     >> [  954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes:
>     >> Failed to remove queue 0 for dev 5257
>     >> [  954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
>     >> [  954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
>     Completed
>     >> [  955.865591] amdgpu: process_termination_cpsch started
>     >> [  955.866432] amdgpu: process_termination_cpsch started
>     >> [  955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove
>     queue 0
>     >> [  956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to
>     respond to
>     >> msg=REMOVE_QUEUE
>     >> [  956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]]
>     *ERROR*
>     >> failed to unmap legacy queue
>     >> [  958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to
>     respond to
>     >> msg=REMOVE_QUEUE
>     >> [  958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]]
>     *ERROR*
>     >> failed to unmap legacy queue
>     >> [  960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to
>     respond to
>     >> msg=REMOVE_QUEUE
>     >> [  960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]]
>     *ERROR*
>     >> failed to unmap legacy queue
>     >> [  960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to
>     >> halt cp gfx
>     >>
>     >> Signed-off-by: Mika Laitio <lamikr at gmail.com>
>     >> ---
>     >>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 24
>     ++++++++++++-------
>     >>   1 file changed, 16 insertions(+), 8 deletions(-)
>     >>
>     >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>     >> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>     >> index c79fe9069e22..96088d480e09 100644
>     >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>     >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>     >> @@ -1187,9 +1187,12 @@ static int
>     evict_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >>       struct kfd_process_device *pdd;
>     >>       int retval = 0;
>     >>   +    // gfx1103 APU can fail to remove queue on evict/restore
>     cycle
>     >> +    if (dqm->dev->adev->flags & AMD_IS_APU)
>     >> +        goto out;
>     >>       dqm_lock(dqm);
>     >>       if (qpd->evicted++ > 0) /* already evicted, do nothing */
>     >> -        goto out;
>     >> +        goto out_unlock;
>     >>         pdd = qpd_to_pdd(qpd);
>     >>   @@ -1198,7 +1201,7 @@ static int
>     evict_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >>        * Skip queue eviction on process eviction.
>     >>        */
>     >>       if (!pdd->drm_priv)
>     >> -        goto out;
>     >> +        goto out_unlock;
>     >>         pr_debug_ratelimited("Evicting PASID 0x%x queues\n",
>     >>                   pdd->process->pasid);
>     >> @@ -1219,7 +1222,7 @@ static int evict_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >>               if (retval) {
>     >>                   dev_err(dev, "Failed to evict queue %d\n",
>     >>                       q->properties.queue_id);
>     >> -                goto out;
>     >> +                goto out_unlock;
>     >>               }
>     >>           }
>     >>       }
>     >> @@ -1231,8 +1234,9 @@ static int evict_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >> KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
>     >> USE_DEFAULT_GRACE_PERIOD);
>     >>   -out:
>     >> +out_unlock:
>     >>       dqm_unlock(dqm);
>     >> +out:
>     >>       return retval;
>     >>   }
>     >>   @@ -1326,14 +1330,17 @@ static int
>     >> restore_process_queues_cpsch(struct device_queue_manager *dqm,
>     >>       uint64_t eviction_duration;
>     >>       int retval = 0;
>     >>   +    // gfx1103 APU can fail to remove queue on evict/restore
>     cycle
>     >> +    if (dqm->dev->adev->flags & AMD_IS_APU)
>     >> +        goto out;
>     >>       pdd = qpd_to_pdd(qpd);
>     >>         dqm_lock(dqm);
>     >>       if (WARN_ON_ONCE(!qpd->evicted)) /* already restored, do
>     >> nothing */
>     >> -        goto out;
>     >> +        goto out_unlock;
>     >>       if (qpd->evicted > 1) { /* ref count still > 0, decrement &
>     >> quit */
>     >>           qpd->evicted--;
>     >> -        goto out;
>     >> +        goto out_unlock;
>     >>       }
>     >>         /* The debugger creates processes that temporarily have
>     not
>     >> acquired
>     >> @@ -1364,7 +1371,7 @@ static int
>     restore_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >>               if (retval) {
>     >>                   dev_err(dev, "Failed to restore queue %d\n",
>     >>                       q->properties.queue_id);
>     >> -                goto out;
>     >> +                goto out_unlock;
>     >>               }
>     >>           }
>     >>       }
>     >> @@ -1375,8 +1382,9 @@ static int
>     restore_process_queues_cpsch(struct
>     >> device_queue_manager *dqm,
>     >>       atomic64_add(eviction_duration,
>     &pdd->evict_duration_counter);
>     >>   vm_not_acquired:
>     >>       qpd->evicted = 0;
>     >> -out:
>     >> +out_unlock:
>     >>       dqm_unlock(dqm);
>     >> +out:
>     >>       return retval;
>     >>   }
>     >
>


More information about the dri-devel mailing list