<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2022-03-08 12:20, Somalapuram,
Amaranath wrote:<br>
</div>
<blockquote type="cite" cite="mid:abf6d329-6a3f-26f0-1d5b-75b3ff55acfd@amd.com">
<p><br>
</p>
<div class="moz-cite-prefix">On 3/8/2022 10:00 PM, Sharma,
Shashank wrote:<br>
</div>
<blockquote type="cite" cite="mid:bc293ab7-db45-2b16-aeb8-291cffef8ba4@amd.com">Hello
Andrey <br>
<br>
On 3/8/2022 5:26 PM, Andrey Grodzovsky wrote: <br>
<blockquote type="cite"> <br>
On 2022-03-07 11:26, Shashank Sharma wrote: <br>
<blockquote type="cite">From: Shashank Sharma <a class="moz-txt-link-rfc2396E" href="mailto:shashank.sharma@amd.com" moz-do-not-send="true"><shashank.sharma@amd.com></a>
<br>
<br>
This patch adds a work function, which will get scheduled <br>
in event of a GPU reset, and will send a uevent to user with
<br>
some reset context infomration, like a PID and some flags. <br>
</blockquote>
<br>
<br>
Where is the actual scheduling of the work function ?
Shouldn't <br>
there be a patch for that too ? <br>
<br>
</blockquote>
<br>
Yes, Amar is working on that patch, on top of these patches.
They should be out soon. I thought it was a good idea to get
quick feedback on the basic patches before we build something on
top of it. <br>
<br>
</blockquote>
<p>schedule_work() will be called in the function
amdgpu_do_asic_reset () <br>
</p>
</blockquote>
<p><br>
</p>
<p>I didn't follow closely on the requirements and so I don't know
but, what about<br>
job timeout that was able to soft recover - do you need to cover
this too ? Or<br>
in this case no need to restart user application and you hence
don't care ?</p>
<p>Andrey</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:abf6d329-6a3f-26f0-1d5b-75b3ff55acfd@amd.com">
<p> </p>
<p>after getting vram_lost info:<br>
</p>
<p>vram_lost = amdgpu_device_check_vram_lost(tmp_adev);</p>
<p>update amdgpu_reset_event_ctx and call schedule_work()</p>
<ul>
<li>vram_lost</li>
<li>reset_context->job->vm->task_info.process_name</li>
<li>reset_context->job->vm->task_info.pid</li>
</ul>
Regards,<br>
S.Amarnath<br>
<blockquote type="cite" cite="mid:bc293ab7-db45-2b16-aeb8-291cffef8ba4@amd.com">-
Shashank <br>
<br>
<blockquote type="cite">Andrey <br>
<br>
<br>
<blockquote type="cite"> <br>
The userspace can do some recovery and post-processing work
<br>
based on this event. <br>
<br>
V2: <br>
- Changed the name of the work to gpu_reset_event_work <br>
(Christian) <br>
- Added a structure to accommodate some additional
information <br>
(like a PID and some flags) <br>
<br>
Cc: Alexander Deucher <a class="moz-txt-link-rfc2396E" href="mailto:alexander.deucher@amd.com" moz-do-not-send="true"><alexander.deucher@amd.com></a>
<br>
Cc: Christian Koenig <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com" moz-do-not-send="true"><christian.koenig@amd.com></a>
<br>
Signed-off-by: Shashank Sharma <a class="moz-txt-link-rfc2396E" href="mailto:shashank.sharma@amd.com" moz-do-not-send="true"><shashank.sharma@amd.com></a>
<br>
--- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 7 +++++++ <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19
+++++++++++++++++++ <br>
2 files changed, 26 insertions(+) <br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
index d8b854fcbffa..7df219fe363f 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
@@ -813,6 +813,11 @@ struct amd_powerplay { <br>
#define AMDGPU_RESET_MAGIC_NUM 64 <br>
#define AMDGPU_MAX_DF_PERFMONS 4 <br>
#define AMDGPU_PRODUCT_NAME_LEN 64 <br>
+struct amdgpu_reset_event_ctx { <br>
+ uint64_t pid; <br>
+ uint32_t flags; <br>
+}; <br>
+ <br>
struct amdgpu_device { <br>
struct device *dev; <br>
struct pci_dev *pdev; <br>
@@ -1063,6 +1068,7 @@ struct amdgpu_device { <br>
int asic_reset_res; <br>
struct work_struct xgmi_reset_work; <br>
+ struct work_struct gpu_reset_event_work; <br>
struct list_head reset_list; <br>
long gfx_timeout; <br>
@@ -1097,6 +1103,7 @@ struct amdgpu_device { <br>
pci_channel_state_t pci_channel_state; <br>
struct amdgpu_reset_control *reset_cntl; <br>
+ struct amdgpu_reset_event_ctx reset_event_ctx; <br>
uint32_t
ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE]; <br>
bool ram_is_direct_mapped; <br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
index ed077de426d9..c43d099da06d 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
@@ -73,6 +73,7 @@ <br>
#include <linux/pm_runtime.h> <br>
#include <drm/drm_drv.h> <br>
+#include <drm/drm_sysfs.h> <br>
MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin"); <br>
MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin"); <br>
@@ -3277,6 +3278,23 @@ bool
amdgpu_device_has_dc_support(struct amdgpu_device *adev) <br>
return
amdgpu_device_asic_has_dc_support(adev->asic_type); <br>
} <br>
+static void amdgpu_device_reset_event_func(struct
work_struct *__work) <br>
+{ <br>
+ struct amdgpu_device *adev = container_of(__work,
struct amdgpu_device, <br>
+ gpu_reset_event_work); <br>
+ struct amdgpu_reset_event_ctx *event_ctx =
&adev->reset_event_ctx; <br>
+ <br>
+ /* <br>
+ * A GPU reset has happened, indicate the userspace and
pass the <br>
+ * following information: <br>
+ * - pid of the process involved, <br>
+ * - if the VRAM is valid or not, <br>
+ * - indicate that userspace may want to collect the
ftrace event <br>
+ * data from the trace event. <br>
+ */ <br>
+ drm_sysfs_reset_event(&adev->ddev,
event_ctx->pid, event_ctx->flags); <br>
+} <br>
+ <br>
static void amdgpu_device_xgmi_reset_func(struct
work_struct *__work) <br>
{ <br>
struct amdgpu_device *adev = <br>
@@ -3525,6 +3543,7 @@ int amdgpu_device_init(struct
amdgpu_device *adev, <br>
amdgpu_device_delay_enable_gfx_off); <br>
INIT_WORK(&adev->xgmi_reset_work,
amdgpu_device_xgmi_reset_func); <br>
+ INIT_WORK(&adev->gpu_reset_event_work,
amdgpu_device_reset_event_func); <br>
adev->gfx.gfx_off_req_count = 1; <br>
adev->pm.ac_power =
power_supply_is_system_supplied() > 0; <br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</body>
</html>