<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;" align="Left">
[AMD Official Use Only]<br>
</p>
<br>
<div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Hawking,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
GPU reset will also be called in <span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important">dev->kfd2kgd->ras_process_cb,
this patch is to add page retirement handling before gpu reset.</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important">unmap_queue
mode (reset or preemption) is another story, I'll write a new patch after unmap_queue reset mode becomes functional.</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important"><br>
</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important">I
have another patch which adds poison mode flag in amdgpu_ras, page retirement will be skipped in DF irq handler if poison mode is ture. The patch will be sent out after RAS TA supports query of poison status.</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important"><br>
</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important">Regards,</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color:rgb(32, 31, 30);font-family:"Segoe UI", "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif;font-size:14.6667px;background-color:rgb(255, 255, 255);display:inline !important">Tao</span></div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Zhang, Hawking <Hawking.Zhang@amd.com><br>
<b>Sent:</b> Tuesday, August 24, 2021 5:10 PM<br>
<b>To:</b> Zhang, Hawking <Hawking.Zhang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Yang, Stanley <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com><br>
<b>Subject:</b> RE: [PATCH] amd/amdkfd: add ras page retirement handling for sq/sdma interrupt</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">[AMD Official Use Only]<br>
<br>
How about we add a new member in ras context (amdgpu_ras) to indicate the poison consumption handling mode/approach? In such way, we can initialize that member per ASIC.<br>
<br>
Regards,<br>
Hawking<br>
<br>
-----Original Message-----<br>
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhang, Hawking<br>
Sent: Tuesday, August 24, 2021 17:04<br>
To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org; Yang, Stanley <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com><br>
Subject: RE: [PATCH] amd/amdkfd: add ras page retirement handling for sq/sdma interrupt<br>
<br>
[AMD Official Use Only]<br>
<br>
Hi Tao,<br>
<br>
This will break mode 2 reset solution, right? But we have to keep mode 2 reset solution as the default one for now. I think we need a new interface to allow KFD switch between unmap_queue and mode 2 reset solution<br>
<br>
Regards,<br>
Hawking<br>
<br>
-----Original Message-----<br>
From: Zhou1, Tao <Tao.Zhou1@amd.com> <br>
Sent: Tuesday, August 24, 2021 16:43<br>
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com><br>
Cc: Zhou1, Tao <Tao.Zhou1@amd.com><br>
Subject: [PATCH] amd/amdkfd: add ras page retirement handling for sq/sdma interrupt<br>
<br>
In ras poison mode, page retirement will be handled by the irq handler of the module which consumes corrupted data.<br>
<br>
Signed-off-by: Tao Zhou <tao.zhou1@amd.com><br>
---<br>
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 13 ++++++++++++-<br>
drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 10 ++++++++--<br>
drivers/gpu/drm/amd/include/kgd_kfd_interface.h | 1 +<br>
3 files changed, 21 insertions(+), 3 deletions(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c<br>
index 46cd4ee6bafb..eb5e9c1b1073 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c<br>
@@ -23,6 +23,16 @@<br>
#include "amdgpu_amdkfd.h"<br>
#include "amdgpu_amdkfd_arcturus.h"<br>
#include "amdgpu_amdkfd_gfx_v9.h"<br>
+#include "amdgpu_ras.h"<br>
+#include "amdgpu_umc.h"<br>
+<br>
+int kgd_aldebaran_ras_process_cb(struct kgd_dev *kgd) {<br>
+ struct amdgpu_device *adev = (struct amdgpu_device *)kgd;<br>
+ struct ras_err_data err_data = {0, 0, 0, NULL};<br>
+<br>
+ return amdgpu_umc_process_ras_data_cb(adev, &err_data, NULL); }<br>
<br>
const struct kfd2kgd_calls aldebaran_kfd2kgd = {<br>
.program_sh_mem_settings = kgd_gfx_v9_program_sh_mem_settings,<br>
@@ -44,5 +54,6 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {<br>
.get_atc_vmid_pasid_mapping_info =<br>
kgd_gfx_v9_get_atc_vmid_pasid_mapping_info,<br>
.set_vm_context_page_table_base = kgd_gfx_v9_set_vm_context_page_table_base,<br>
- .program_trap_handler_settings = kgd_gfx_v9_program_trap_handler_settings<br>
+ .program_trap_handler_settings = kgd_gfx_v9_program_trap_handler_settings,<br>
+ .ras_process_cb = kgd_aldebaran_ras_process_cb<br>
};<br>
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c<br>
index 12d91e53556c..851b5120927a 100644<br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c<br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c<br>
@@ -231,7 +231,10 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,<br>
if (sq_intr_err != SQ_INTERRUPT_ERROR_TYPE_ILLEGAL_INST &&<br>
sq_intr_err != SQ_INTERRUPT_ERROR_TYPE_MEMVIOL) {<br>
kfd_signal_poison_consumed_event(dev, pasid);<br>
- amdgpu_amdkfd_gpu_reset(dev->kgd);<br>
+ if (dev->kfd2kgd->ras_process_cb)<br>
+ dev->kfd2kgd->ras_process_cb(dev->kgd);<br>
+ else<br>
+ amdgpu_amdkfd_gpu_reset(dev->kgd);<br>
return;<br>
}<br>
break;<br>
@@ -253,7 +256,10 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,<br>
kfd_signal_event_interrupt(pasid, context_id0 & 0xfffffff, 28);<br>
} else if (source_id == SOC15_INTSRC_SDMA_ECC) {<br>
kfd_signal_poison_consumed_event(dev, pasid);<br>
- amdgpu_amdkfd_gpu_reset(dev->kgd);<br>
+ if (dev->kfd2kgd->ras_process_cb)<br>
+ dev->kfd2kgd->ras_process_cb(dev->kgd);<br>
+ else<br>
+ amdgpu_amdkfd_gpu_reset(dev->kgd);<br>
return;<br>
}<br>
} else if (client_id == SOC15_IH_CLIENTID_VMC || diff --git a/drivers/gpu/drm/amd/include/kgd_kfd_interface.h b/drivers/gpu/drm/amd/include/kgd_kfd_interface.h<br>
index c84bd7b2cf59..9e6525871ad4 100644<br>
--- a/drivers/gpu/drm/amd/include/kgd_kfd_interface.h<br>
+++ b/drivers/gpu/drm/amd/include/kgd_kfd_interface.h<br>
@@ -301,6 +301,7 @@ struct kfd2kgd_calls {<br>
int *max_waves_per_cu);<br>
void (*program_trap_handler_settings)(struct kgd_dev *kgd,<br>
uint32_t vmid, uint64_t tba_addr, uint64_t tma_addr);<br>
+ int (*ras_process_cb)(struct kgd_dev *kgd);<br>
};<br>
<br>
#endif /* KGD_KFD_INTERFACE_H_INCLUDED */<br>
--<br>
2.17.1<br>
</div>
</span></font></div>
</div>
</body>
</html>