<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 5/20/2022 7:52 PM, Sharma, Shashank
wrote:<br>
</div>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<br>
<br>
On 5/20/2022 3:49 PM, Somalapuram Amaranath wrote:
<br>
<blockquote type="cite">Added device coredump information:
<br>
- Kernel version
<br>
- Module
<br>
- Time
<br>
- VRAM status
<br>
- Guilty process name and PID
<br>
- GPU register dumps
<br>
<br>
Signed-off-by: Somalapuram Amaranath
<a class="moz-txt-link-rfc2396E" href="mailto:Amaranath.Somalapuram@amd.com"><Amaranath.Somalapuram@amd.com></a>
<br>
---
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 ++
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 59
++++++++++++++++++++++
<br>
2 files changed, 62 insertions(+)
<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
<br>
index c79d9992b113..f28d9c563f74 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
<br>
@@ -1044,6 +1044,9 @@ struct amdgpu_device {
<br>
uint32_t *reset_dump_reg_list;
<br>
uint32_t *reset_dump_reg_value;
<br>
int num_regs;
<br>
+ struct amdgpu_task_info reset_context_task_info;
<br>
+ bool reset_context_vram_lost;
<br>
</blockquote>
<br>
How about drop the 'context' from name and just reset_task_info
and reset_vram_lost ?
<br>
</blockquote>
OK.<br>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<br>
<blockquote type="cite">+ struct timespec64
reset_time;
<br>
struct amdgpu_reset_domain *reset_domain;
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
<br>
index 963c897a76e6..f9b710e741a7 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
<br>
@@ -32,6 +32,8 @@
<br>
#include <linux/slab.h>
<br>
#include <linux/iommu.h>
<br>
#include <linux/pci.h>
<br>
+#include <linux/devcoredump.h>
<br>
+#include <generated/utsrelease.h>
<br>
#include <drm/drm_atomic_helper.h>
<br>
#include <drm/drm_probe_helper.h>
<br>
@@ -4733,6 +4735,55 @@ static int amdgpu_reset_reg_dumps(struct
amdgpu_device *adev)
<br>
return 0;
<br>
}
<br>
+#ifdef CONFIG_DEV_COREDUMP
<br>
+static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t
offset,
<br>
+ size_t count, void *data, size_t datalen)
<br>
+{
<br>
+ struct drm_printer p;
<br>
+ struct amdgpu_device *adev = data;
<br>
+ struct drm_print_iterator iter;
<br>
+ int i;
<br>
+
<br>
</blockquote>
<br>
A NULL check for 'buffer' here could prevent a segfault later.
<br>
<br>
</blockquote>
Agreed.<br>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<blockquote type="cite">+ iter.data = buffer;
<br>
+ iter.offset = 0;
<br>
+ iter.start = offset;
<br>
+ iter.remain = count;
<br>
+
<br>
+ p = drm_coredump_printer(&iter);
<br>
+
<br>
+ drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
<br>
+ drm_printf(&p, "kernel: " UTS_RELEASE "\n");
<br>
+ drm_printf(&p, "module: " KBUILD_MODNAME "\n");
<br>
+ drm_printf(&p, "time: %lld.%09ld\n",
adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
<br>
+ if (adev->reset_context_task_info.pid)
<br>
+ drm_printf(&p, "process_name: %s PID: %d\n",
<br>
+
adev->reset_context_task_info.process_name,
<br>
+
adev->reset_context_task_info.pid);
<br>
</blockquote>
Please fix the alignment of print variables.
<br>
<br>
</blockquote>
I will cross check this.<br>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<blockquote type="cite">+
<br>
+ if (adev->reset_context_vram_lost)
<br>
+ drm_printf(&p, "VRAM is lost due to GPU reset!\n");
<br>
+ if (adev->num_regs) {
<br>
+ drm_printf(&p, "AMDGPU register dumps:\nOffset:
Value:\n");
<br>
+
<br>
+ for (i = 0; i < adev->num_regs; i++)
<br>
+ drm_printf(&p, "0x%08x: 0x%08x\n",
<br>
+ adev->reset_dump_reg_list[i],
<br>
+ adev->reset_dump_reg_value[i]);
<br>
+ }
<br>
+
<br>
+ return count - iter.remain;
<br>
+}
<br>
+
<br>
+static void amdgpu_reset_capture_coredumpm(struct amdgpu_device
*adev)
<br>
+{
<br>
+ struct drm_device *dev = adev_to_drm(adev);
<br>
+
<br>
+ ktime_get_ts64(&adev->reset_time);
<br>
+ dev_coredumpm(dev->dev, THIS_MODULE, adev, 0,
GFP_KERNEL,
<br>
+ amdgpu_devcoredump_read, NULL);
<br>
</blockquote>
instead of registering NULL as free function, I would prefer you
to have a dummy no_op free function registered, which we can
consume if something changes.
<br>
</blockquote>
you mean something like this (function without any code):<br>
<div style="color: #000000;background-color: #ffffff;font-family: Consolas, 'Courier New', monospace;font-weight: normal;font-size: 14px;line-height: 19px;white-space: pre;"><div><span style="color: #0000ff;">static</span><span style="color: #000000;"> </span><span style="color: #0000ff;">void</span><span style="color: #000000;"> amdgpu_devcoredump_free(</span><span style="color: #0000ff;">void</span><span style="color: #000000;"> *data)</span></div></div>
<div style="color: #000000;background-color: #ffffff;font-family: Consolas, 'Courier New', monospace;font-weight: normal;font-size: 14px;line-height: 19px;white-space: pre;"><div><span style="color: #000000;">{</span></div><div><span style="color: #000000;">}</span></div>
</div>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<blockquote type="cite">+}
<br>
+#endif
<br>
+
<br>
int amdgpu_do_asic_reset(struct list_head *device_list_handle,
<br>
struct amdgpu_reset_context *reset_context)
<br>
{
<br>
@@ -4817,6 +4868,14 @@ int amdgpu_do_asic_reset(struct list_head
*device_list_handle,
<br>
goto out;
<br>
vram_lost =
amdgpu_device_check_vram_lost(tmp_adev);
<br>
+#ifdef CONFIG_DEV_COREDUMP
<br>
+ tmp_adev->reset_context_vram_lost =
vram_lost;
<br>
+ tmp_adev->reset_context_task_info.pid = 0;
<br>
</blockquote>
why is the PID hardcoded to 0 ?
<br>
</blockquote>
in case of reset context reset_context->job->vm is null
(possibility that reset can be non VM related). <br>
If we don't set tmp_adev->reset_context_task_info.pid = 0, it
will show previous reset valid PID.
<p><br>
</p>
Regards,<br>
S.Amarnath<br>
<blockquote type="cite" cite="mid:588a0599-7d0c-0041-9877-4429b416e7ed@amd.com">
<blockquote type="cite">+ if (reset_context->job
&& reset_context->job->vm)
<br>
+ tmp_adev->reset_context_task_info =
<br>
+
reset_context->job->vm->task_info;
<br>
+ amdgpu_reset_capture_coredumpm(tmp_adev);
<br>
+#endif
<br>
if (vram_lost) {
<br>
DRM_INFO("VRAM is lost due to GPU
reset!\n");
<br>
<br>
</blockquote>
- Shashank
<br>
amdgpu_inc_vram_lost(tmp_adev);
<br>
</blockquote>
</body>
</html>