<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body>
Will it looks cleaner if we keep a pre_reset flag in per device structure and check it in the function when talk to hw?<br>
<br>
Regards<br>
Shaoyun.liu
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Kuehling, Felix <Felix.Kuehling@amd.com><br>
<b>Sent:</b> December 19, 2019 9:21:08 PM<br>
<b>To:</b> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Liu, Shaoyun <Shaoyun.Liu@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com><br>
<b>Subject:</b> Re: [PATCH 1/1] drm/amdkfd: Don't touch the hardware in pre_reset callback</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">[+Andrey]<br>
<br>
Hi Shaoyun, Monk, and Andrey,<br>
<br>
I tested this on my bare-metal Vega10 system. GPU reset (using BACO) is <br>
flaky on this system with and without this patch. The first reset seems <br>
to work OK, the second one fails in different ways. In theory this <br>
change should be an improvement as it eliminated at least one failure <br>
mode (hanging before the actual reset). A least I don't think I'm making <br>
anything worse.<br>
<br>
My test is to run KFDTest and trigger a GPU reset while it's busy with <br>
the eviction test. This test runs multiple KFD processes and also <br>
performs graphics memory allocations and command submissions to force <br>
TTM to evict KFD processes. I trigger the reset by hanging the HWS <br>
through a debugfs hook:<br>
<br>
# cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id > <br>
/sys/kernel/debug/kfd/hang_hws<br>
<br>
Please give this patch a try and reinstate the pre_reset call before the <br>
FLR for SRIOV.<br>
<br>
Regards,<br>
Felix<br>
<br>
On 2019-12-19 9:09 p.m., Felix Kuehling wrote:<br>
> The intention of the pre_reset callback is to update the driver<br>
> state to reflect that all user mode queues are preempted and the<br>
> HIQ is destroyed. However we should not actually preempt any queues<br>
> or otherwise touch the hardware because it is presumably hanging.<br>
> The impending reset will take care of actually stopping all queues.<br>
><br>
> This should prevent KIQ register access hanging on SRIOV function<br>
> level reset (FLR). It should also speed up the reset by avoiding<br>
> unnecessary timeouts from a potentially hanging GPU scheduler.<br>
><br>
> CC: shaoyunl <shaoyun.liu@amd.com><br>
> CC: Liu Monk <Monk.Liu@amd.com><br>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com><br>
> ---<br>
> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 24 ++++++++++-------<br>
> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 27 ++++++++++++-------<br>
> .../drm/amd/amdkfd/kfd_device_queue_manager.h | 5 ++--<br>
> drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 8 +++---<br>
> .../gpu/drm/amd/amdkfd/kfd_packet_manager.c | 4 +--<br>
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 8 +++---<br>
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 11 ++++----<br>
> .../amd/amdkfd/kfd_process_queue_manager.c | 2 +-<br>
> 8 files changed, 53 insertions(+), 36 deletions(-)<br>
><br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br>
> index c6b6901bbda3..796996a0d926 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c<br>
> @@ -486,6 +486,7 @@ static int kfd_gtt_sa_init(struct kfd_dev *kfd, unsigned int buf_size,<br>
> unsigned int chunk_size);<br>
> static void kfd_gtt_sa_fini(struct kfd_dev *kfd);<br>
> <br>
> +static void kfd_suspend(struct kfd_dev *kfd, bool pre_reset);<br>
> static int kfd_resume(struct kfd_dev *kfd);<br>
> <br>
> struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,<br>
> @@ -728,7 +729,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)<br>
> {<br>
> if (!kfd->init_complete)<br>
> return 0;<br>
> - kgd2kfd_suspend(kfd);<br>
> + kfd_suspend(kfd, true);<br>
> <br>
> kfd_signal_reset_event(kfd);<br>
> return 0;<br>
> @@ -767,13 +768,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd)<br>
> if (!kfd->init_complete)<br>
> return;<br>
> <br>
> - /* For first KFD device suspend all the KFD processes */<br>
> - if (atomic_inc_return(&kfd_locked) == 1)<br>
> - kfd_suspend_all_processes();<br>
> -<br>
> - kfd->dqm->ops.stop(kfd->dqm);<br>
> -<br>
> - kfd_iommu_suspend(kfd);<br>
> + kfd_suspend(kfd, false);<br>
> }<br>
> <br>
> int kgd2kfd_resume(struct kfd_dev *kfd)<br>
> @@ -795,6 +790,17 @@ int kgd2kfd_resume(struct kfd_dev *kfd)<br>
> return ret;<br>
> }<br>
> <br>
> +static void kfd_suspend(struct kfd_dev *kfd, bool pre_reset)<br>
> +{<br>
> + /* For first KFD device suspend all the KFD processes */<br>
> + if (atomic_inc_return(&kfd_locked) == 1)<br>
> + kfd_suspend_all_processes(pre_reset);<br>
> +<br>
> + kfd->dqm->ops.stop(kfd->dqm, pre_reset);<br>
> +<br>
> + kfd_iommu_suspend(kfd);<br>
> +}<br>
> +<br>
> static int kfd_resume(struct kfd_dev *kfd)<br>
> {<br>
> int err = 0;<br>
> @@ -877,7 +883,7 @@ int kgd2kfd_quiesce_mm(struct mm_struct *mm)<br>
> if (!p)<br>
> return -ESRCH;<br>
> <br>
> - r = kfd_process_evict_queues(p);<br>
> + r = kfd_process_evict_queues(p, false);<br>
> <br>
> kfd_unref_process(p);<br>
> return r;<br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c<br>
> index f7f6df40875e..3a49944164da 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c<br>
> @@ -592,7 +592,8 @@ static int update_queue(struct device_queue_manager *dqm, struct queue *q)<br>
> }<br>
> <br>
> static int evict_process_queues_nocpsch(struct device_queue_manager *dqm,<br>
> - struct qcm_process_device *qpd)<br>
> + struct qcm_process_device *qpd,<br>
> + bool pre_reset)<br>
> {<br>
> struct queue *q;<br>
> struct mqd_manager *mqd_mgr;<br>
> @@ -622,6 +623,8 @@ static int evict_process_queues_nocpsch(struct device_queue_manager *dqm,<br>
> <br>
> if (WARN_ONCE(!dqm->sched_running, "Evict when stopped\n"))<br>
> continue;<br>
> + if (pre_reset)<br>
> + continue;<br>
> <br>
> retval = mqd_mgr->destroy_mqd(mqd_mgr, q->mqd,<br>
> KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN,<br>
> @@ -639,7 +642,8 @@ static int evict_process_queues_nocpsch(struct device_queue_manager *dqm,<br>
> }<br>
> <br>
> static int evict_process_queues_cpsch(struct device_queue_manager *dqm,<br>
> - struct qcm_process_device *qpd)<br>
> + struct qcm_process_device *qpd,<br>
> + bool pre_reset)<br>
> {<br>
> struct queue *q;<br>
> struct kfd_process_device *pdd;<br>
> @@ -664,6 +668,10 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,<br>
> q->properties.is_active = false;<br>
> dqm->queue_count--;<br>
> }<br>
> +<br>
> + if (pre_reset)<br>
> + goto out;<br>
> +<br>
> retval = execute_queues_cpsch(dqm,<br>
> qpd->is_debug ?<br>
> KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES :<br>
> @@ -944,10 +952,10 @@ static int start_nocpsch(struct device_queue_manager *dqm)<br>
> return 0;<br>
> }<br>
> <br>
> -static int stop_nocpsch(struct device_queue_manager *dqm)<br>
> +static int stop_nocpsch(struct device_queue_manager *dqm, bool pre_reset)<br>
> {<br>
> if (dqm->dev->device_info->asic_family == CHIP_HAWAII)<br>
> - pm_uninit(&dqm->packets);<br>
> + pm_uninit(&dqm->packets, pre_reset);<br>
> dqm->sched_running = false;<br>
> <br>
> return 0;<br>
> @@ -1107,20 +1115,21 @@ static int start_cpsch(struct device_queue_manager *dqm)<br>
> return 0;<br>
> fail_allocate_vidmem:<br>
> fail_set_sched_resources:<br>
> - pm_uninit(&dqm->packets);<br>
> + pm_uninit(&dqm->packets, false);<br>
> fail_packet_manager_init:<br>
> return retval;<br>
> }<br>
> <br>
> -static int stop_cpsch(struct device_queue_manager *dqm)<br>
> +static int stop_cpsch(struct device_queue_manager *dqm, bool pre_reset)<br>
> {<br>
> dqm_lock(dqm);<br>
> - unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);<br>
> + if (!pre_reset)<br>
> + unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);<br>
> dqm->sched_running = false;<br>
> dqm_unlock(dqm);<br>
> <br>
> kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);<br>
> - pm_uninit(&dqm->packets);<br>
> + pm_uninit(&dqm->packets, pre_reset);<br>
> <br>
> return 0;<br>
> }<br>
> @@ -1891,7 +1900,7 @@ int kfd_process_vm_fault(struct device_queue_manager *dqm,<br>
> return -EINVAL;<br>
> pdd = kfd_get_process_device_data(dqm->dev, p);<br>
> if (pdd)<br>
> - ret = dqm->ops.evict_process_queues(dqm, &pdd->qpd);<br>
> + ret = dqm->ops.evict_process_queues(dqm, &pdd->qpd, false);<br>
> kfd_unref_process(p);<br>
> <br>
> return ret;<br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h<br>
> index a8c37e6da027..9f82f95f6a92 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h<br>
> @@ -103,7 +103,7 @@ struct device_queue_manager_ops {<br>
> <br>
> int (*initialize)(struct device_queue_manager *dqm);<br>
> int (*start)(struct device_queue_manager *dqm);<br>
> - int (*stop)(struct device_queue_manager *dqm);<br>
> + int (*stop)(struct device_queue_manager *dqm, bool pre_reset);<br>
> void (*uninitialize)(struct device_queue_manager *dqm);<br>
> int (*create_kernel_queue)(struct device_queue_manager *dqm,<br>
> struct kernel_queue *kq,<br>
> @@ -129,7 +129,8 @@ struct device_queue_manager_ops {<br>
> struct qcm_process_device *qpd);<br>
> <br>
> int (*evict_process_queues)(struct device_queue_manager *dqm,<br>
> - struct qcm_process_device *qpd);<br>
> + struct qcm_process_device *qpd,<br>
> + bool pre_reset);<br>
> int (*restore_process_queues)(struct device_queue_manager *dqm,<br>
> struct qcm_process_device *qpd);<br>
> <br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c<br>
> index 2d56dc534459..dbd2e8e9ae69 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c<br>
> @@ -195,9 +195,9 @@ static bool kq_initialize(struct kernel_queue *kq, struct kfd_dev *dev,<br>
> }<br>
> <br>
> /* Uninitialize a kernel queue and free all its memory usages. */<br>
> -static void kq_uninitialize(struct kernel_queue *kq)<br>
> +static void kq_uninitialize(struct kernel_queue *kq, bool pre_reset)<br>
> {<br>
> - if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ)<br>
> + if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && !pre_reset)<br>
> kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,<br>
> kq->queue->mqd,<br>
> KFD_PREEMPT_TYPE_WAVEFRONT_RESET,<br>
> @@ -337,9 +337,9 @@ struct kernel_queue *kernel_queue_init(struct kfd_dev *dev,<br>
> return NULL;<br>
> }<br>
> <br>
> -void kernel_queue_uninit(struct kernel_queue *kq)<br>
> +void kernel_queue_uninit(struct kernel_queue *kq, bool pre_reset)<br>
> {<br>
> - kq_uninitialize(kq);<br>
> + kq_uninitialize(kq, pre_reset);<br>
> kfree(kq);<br>
> }<br>
> <br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c<br>
> index 6cabed06ef5d..7732a3bbebd6 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c<br>
> @@ -264,10 +264,10 @@ int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm)<br>
> return 0;<br>
> }<br>
> <br>
> -void pm_uninit(struct packet_manager *pm)<br>
> +void pm_uninit(struct packet_manager *pm, bool pre_reset)<br>
> {<br>
> mutex_destroy(&pm->lock);<br>
> - kernel_queue_uninit(pm->priv_queue);<br>
> + kernel_queue_uninit(pm->priv_queue, pre_reset);<br>
> }<br>
> <br>
> int pm_send_set_resources(struct packet_manager *pm,<br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h<br>
> index 087e96838997..9bc4ced4acba 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h<br>
> @@ -762,9 +762,9 @@ struct kfd_process *kfd_get_process(const struct task_struct *);<br>
> struct kfd_process *kfd_lookup_process_by_pasid(unsigned int pasid);<br>
> struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm);<br>
> void kfd_unref_process(struct kfd_process *p);<br>
> -int kfd_process_evict_queues(struct kfd_process *p);<br>
> +int kfd_process_evict_queues(struct kfd_process *p, bool pre_reset);<br>
> int kfd_process_restore_queues(struct kfd_process *p);<br>
> -void kfd_suspend_all_processes(void);<br>
> +void kfd_suspend_all_processes(bool pre_reset);<br>
> int kfd_resume_all_processes(void);<br>
> <br>
> int kfd_process_device_init_vm(struct kfd_process_device *pdd,<br>
> @@ -883,7 +883,7 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_dev *dev);<br>
> void device_queue_manager_uninit(struct device_queue_manager *dqm);<br>
> struct kernel_queue *kernel_queue_init(struct kfd_dev *dev,<br>
> enum kfd_queue_type type);<br>
> -void kernel_queue_uninit(struct kernel_queue *kq);<br>
> +void kernel_queue_uninit(struct kernel_queue *kq, bool pre_reset);<br>
> int kfd_process_vm_fault(struct device_queue_manager *dqm, unsigned int pasid);<br>
> <br>
> /* Process Queue Manager */<br>
> @@ -974,7 +974,7 @@ extern const struct packet_manager_funcs kfd_vi_pm_funcs;<br>
> extern const struct packet_manager_funcs kfd_v9_pm_funcs;<br>
> <br>
> int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm);<br>
> -void pm_uninit(struct packet_manager *pm);<br>
> +void pm_uninit(struct packet_manager *pm, bool pre_reset);<br>
> int pm_send_set_resources(struct packet_manager *pm,<br>
> struct scheduling_resources *res);<br>
> int pm_send_runlist(struct packet_manager *pm, struct list_head *dqm_queues);<br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c<br>
> index 536a153ac9a4..7bcadd3a1e3b 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c<br>
> @@ -948,7 +948,7 @@ struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm)<br>
> * Eviction is reference-counted per process-device. This means multiple<br>
> * evictions from different sources can be nested safely.<br>
> */<br>
> -int kfd_process_evict_queues(struct kfd_process *p)<br>
> +int kfd_process_evict_queues(struct kfd_process *p, bool pre_reset)<br>
> {<br>
> struct kfd_process_device *pdd;<br>
> int r = 0;<br>
> @@ -956,7 +956,8 @@ int kfd_process_evict_queues(struct kfd_process *p)<br>
> <br>
> list_for_each_entry(pdd, &p->per_device_data, per_device_list) {<br>
> r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,<br>
> - &pdd->qpd);<br>
> + &pdd->qpd,<br>
> + pre_reset);<br>
> if (r) {<br>
> pr_err("Failed to evict process queues\n");<br>
> goto fail;<br>
> @@ -1026,7 +1027,7 @@ static void evict_process_worker(struct work_struct *work)<br>
> flush_delayed_work(&p->restore_work);<br>
> <br>
> pr_debug("Started evicting pasid 0x%x\n", p->pasid);<br>
> - ret = kfd_process_evict_queues(p);<br>
> + ret = kfd_process_evict_queues(p, false);<br>
> if (!ret) {<br>
> dma_fence_signal(p->ef);<br>
> dma_fence_put(p->ef);<br>
> @@ -1082,7 +1083,7 @@ static void restore_process_worker(struct work_struct *work)<br>
> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);<br>
> }<br>
> <br>
> -void kfd_suspend_all_processes(void)<br>
> +void kfd_suspend_all_processes(bool pre_reset)<br>
> {<br>
> struct kfd_process *p;<br>
> unsigned int temp;<br>
> @@ -1092,7 +1093,7 @@ void kfd_suspend_all_processes(void)<br>
> cancel_delayed_work_sync(&p->eviction_work);<br>
> cancel_delayed_work_sync(&p->restore_work);<br>
> <br>
> - if (kfd_process_evict_queues(p))<br>
> + if (kfd_process_evict_queues(p, pre_reset))<br>
> pr_err("Failed to suspend process 0x%x\n", p->pasid);<br>
> dma_fence_signal(p->ef);<br>
> dma_fence_put(p->ef);<br>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c<br>
> index d3eacf72e8db..8fa856e6a03f 100644<br>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c<br>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c<br>
> @@ -374,7 +374,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)<br>
> /* destroy kernel queue (DIQ) */<br>
> dqm = pqn->kq->dev->dqm;<br>
> dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);<br>
> - kernel_queue_uninit(pqn->kq);<br>
> + kernel_queue_uninit(pqn->kq, false);<br>
> }<br>
> <br>
> if (pqn->q) {<br>
</div>
</span></font></div>
</body>
</html>