From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like: - process ID of the process involved with the GPU reset - process name of the involved process - the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar - move the reset information structure to DRM layer - drop _ctx from struct name - make pid 32 bit(than 64) - set flag when VRAM invalid (than valid) - add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com --- drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/** + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset + * @dev: DRM device + * @reset_info: The contextual information about the reset (like PID, flags) + * + * Send a uevent for the DRM device specified by @dev. This informs + * user that a GPU reset has occurred, so that an interested client + * can take any recovery or profiling measure. + */ +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{ + unsigned char pid_str[13]; + unsigned char flags_str[15]; + unsigned char pname_str[TASK_COMM_LEN + 6]; + unsigned char reset_str[] = "RESET=1"; + char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL }; + + if (!reset_info) { + DRM_WARN("No reset info, not sending the event\n"); + return; + } + + DRM_DEBUG("generating reset event\n"); + + snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid); + snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname); + snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags); + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); +} +EXPORT_SYMBOL(drm_sysfs_reset_event); + /** * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector * change diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h> + +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event { + uint32_t pid; + uint32_t flags; + char pname[TASK_COMM_LEN]; +}; + int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev); +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property);
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a work function, which sends a GPU reset uevent and some contextual infomration, like the PID and some status flags. This work should be scheduled during a GPU reset.
The userspace can do some recovery and post-processing work based on this event and information.
V2: Addressed review comments from Christian - Changed the name of the work to gpu_reset_event_work - Added a structure to accommodate some additional information (like a PID and some flags) - Do not add new structure in amdgpu.h
Cc: Alexander Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 ++++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++++++ 2 files changed, 17 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index d8b854fcbffa..6a97c585bdfd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -61,6 +61,7 @@ #include <drm/drm_gem.h> #include <drm/drm_ioctl.h> #include <drm/gpu_scheduler.h> +#include <drm/drm_sysfs.h>
#include <kgd_kfd_interface.h> #include "dm_pp_interface.h" @@ -813,6 +814,7 @@ struct amd_powerplay { #define AMDGPU_RESET_MAGIC_NUM 64 #define AMDGPU_MAX_DF_PERFMONS 4 #define AMDGPU_PRODUCT_NAME_LEN 64 + struct amdgpu_device { struct device *dev; struct pci_dev *pdev; @@ -1063,6 +1065,7 @@ struct amdgpu_device {
int asic_reset_res; struct work_struct xgmi_reset_work; + struct work_struct gpu_reset_event_work; struct list_head reset_list;
long gfx_timeout; @@ -1097,6 +1100,7 @@ struct amdgpu_device { pci_channel_state_t pci_channel_state;
struct amdgpu_reset_control *reset_cntl; + struct drm_reset_event reset_event_info; uint32_t ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];
bool ram_is_direct_mapped; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ed077de426d9..1aef07fd0dff 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -73,6 +73,7 @@ #include <linux/pm_runtime.h>
#include <drm/drm_drv.h> +#include <drm/drm_sysfs.h>
MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin"); MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin"); @@ -3277,6 +3278,17 @@ bool amdgpu_device_has_dc_support(struct amdgpu_device *adev) return amdgpu_device_asic_has_dc_support(adev->asic_type); }
+static void amdgpu_device_reset_event_func(struct work_struct *__work) +{ + struct amdgpu_device *adev = container_of(__work, struct amdgpu_device, + gpu_reset_event_work); + /* + * A GPU reset has happened, inform the userspace and pass the + * reset related information. + */ + drm_sysfs_reset_event(&adev->ddev, &adev->reset_event_info); +} + static void amdgpu_device_xgmi_reset_func(struct work_struct *__work) { struct amdgpu_device *adev = @@ -3525,6 +3537,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, amdgpu_device_delay_enable_gfx_off);
INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func); + INIT_WORK(&adev->gpu_reset_event_work, amdgpu_device_reset_event_func);
adev->gfx.gfx_off_req_count = 1; adev->pm.ac_power = power_supply_is_system_supplied() > 0;
Hi,
Maybe it would be a good idea to state the intended use-case in the commit message? And explain why the current (driver-specific IIRC) APIs aren't enough?
Since this introduces new uAPI, can you point to a user-space patch which uses the new uAPI? See this link for more info on DRM user-space requirements: https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace...
Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and DRM_WARN. This allows identifying on which device the uevent happens.
On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma contactshashanksharma@gmail.com wrote:
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
reset_info can be const.
On 3/9/2022 8:47 AM, Simon Ser wrote:
Hi,
Maybe it would be a good idea to state the intended use-case in the commit message?
It was added in the second patch, but yeah, it makes more sense to add a cover-letter probably.
And explain why the current (driver-specific IIRC) APIs
aren't enough?
As you mentioned, we also started with a driver specific API, and then realized that most of the drivers can benefit from this infrastructure, as most of them would be more or less interested in the similar information.
Since this introduces new uAPI, can you point to a user-space patch which uses the new uAPI? See this link for more info on DRM user-space requirements: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freede...
The userspace work is in progress, this was to get a general feedback from the community and the interest.
Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and DRM_WARN. This allows identifying on which device the uevent happens.
Well, I don't want to break the consistency across the file, as other functions seems to be using DRM_DEBUG family of prints.
On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma contactshashanksharma@gmail.com wrote:
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
reset_info can be const.
Sure.
- Shashank
Am 08.03.22 um 19:04 schrieb Shashank Sharma:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar - move the reset information structure to DRM layer - drop _ctx from struct name - make pid 32 bit(than 64) - set flag when VRAM invalid (than valid) - add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com
drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/**
- drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
- @dev: DRM device
- @reset_info: The contextual information about the reset (like PID, flags)
- Send a uevent for the DRM device specified by @dev. This informs
- user that a GPU reset has occurred, so that an interested client
- can take any recovery or profiling measure.
- */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{
- unsigned char pid_str[13];
- unsigned char flags_str[15];
- unsigned char pname_str[TASK_COMM_LEN + 6];
- unsigned char reset_str[] = "RESET=1";
- char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
In general we try to sort those longest first and maybe break envp on multiple lines.
- if (!reset_info) {
DRM_WARN("No reset info, not sending the event\n");
return;
- }
Please don't do stuff like that, providing no reset_info is a coding error and should really result in a crash.
Regards, Christian.
- DRM_DEBUG("generating reset event\n");
- snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
- snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
- snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
- kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+} +EXPORT_SYMBOL(drm_sysfs_reset_event);
- /**
- drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
- change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h>
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event {
- uint32_t pid;
- uint32_t flags;
- char pname[TASK_COMM_LEN];
+};
int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property);
Hi Shashank,
On 08/03/2022 19:04, Shashank Sharma wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
Would it be possible to include the app parameters as well?
piglit has a shader_runner test application that can cause hangs, and it's run like this:
shader_runner 1.shader_test
Knowing that shader_runner caused the hang is not really useful without the "1.shader_test" part.
Thanks, Pierre-Eric
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar
- move the reset information structure to DRM layer
- drop _ctx from struct name
- make pid 32 bit(than 64)
- set flag when VRAM invalid (than valid)
- add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com
drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/**
- drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
- @dev: DRM device
- @reset_info: The contextual information about the reset (like PID, flags)
- Send a uevent for the DRM device specified by @dev. This informs
- user that a GPU reset has occurred, so that an interested client
- can take any recovery or profiling measure.
- */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{
- unsigned char pid_str[13];
- unsigned char flags_str[15];
- unsigned char pname_str[TASK_COMM_LEN + 6];
- unsigned char reset_str[] = "RESET=1";
- char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
- if (!reset_info) {
DRM_WARN("No reset info, not sending the event\n");
return;
- }
- DRM_DEBUG("generating reset event\n");
- snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
- snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
- snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
- kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+} +EXPORT_SYMBOL(drm_sysfs_reset_event);
/**
- drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
- change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h>
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event {
- uint32_t pid;
- uint32_t flags;
- char pname[TASK_COMM_LEN];
+};
int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev); +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property);
On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com wrote:
Would it be possible to include the app parameters as well?
Can all processes read sysfs events?
There might be security implications here. The app parameters might contain sensitive information, like passwords or tokens.
Can't the uevent receiver query the kernel to get that info from the PID?
Am 09.03.22 um 11:10 schrieb Simon Ser:
On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com wrote:
Would it be possible to include the app parameters as well?
Can all processes read sysfs events?
No, but application parameters are usually not secret.
There might be security implications here. The app parameters might contain sensitive information, like passwords or tokens.
It's a well known security vulnerably to provide password etc.. as application parameter since everybody can read them through procfs.
Can't the uevent receiver query the kernel to get that info from the PID?
I'm leaning also towards just providing the PID and not the application name. The information should be as short as possible.
Regards, Christian.
On Wednesday, March 9th, 2022 at 11:24, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 11:10 schrieb Simon Ser:
On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com wrote:
Would it be possible to include the app parameters as well?
Can all processes read sysfs events?
No, but application parameters are usually not secret.
It's a bad practice, yes. But people still do it.
There might be security implications here. The app parameters might contain sensitive information, like passwords or tokens.
It's a well known security vulnerably to provide password etc.. as application parameter since everybody can read them through procfs.
I was thinking mostly about Flatpak apps here. Flatpak apps are running in a process namespace so they can't query the CLI parameters of PIDs outside of the namespace. But they might still receive sysfs uevents.
On 09/03/2022 11:24, Christian König wrote:
Am 09.03.22 um 11:10 schrieb Simon Ser:
On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com wrote:
Would it be possible to include the app parameters as well?
Can all processes read sysfs events?
No, but application parameters are usually not secret.
There might be security implications here. The app parameters might contain sensitive information, like passwords or tokens.
It's a well known security vulnerably to provide password etc.. as application parameter since everybody can read them through procfs.
Can't the uevent receiver query the kernel to get that info from the PID?
I'm leaning also towards just providing the PID and not the application name. The information should be as short as possible.
Indeed you're both right. The PID + reading procfs should be enough.
Thanks, Pierre-Eric
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
BR, -R
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
Regards, Christian.
BR, -R
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
BR, -R
[1] One small change to allow-list the drm/msm devcore dumps was needed after a privacy review/audit of what is contained in the GPU devcored to ensure it does not contain any PII .. for CrOS on amd devices I'd be happy to walk whoever deals with amd CrOS devices through the process offline.
Regards, Christian.
BR, -R
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
- Shashank
BR, -R
[1] One small change to allow-list the drm/msm devcore dumps was needed after a privacy review/audit of what is contained in the GPU devcored to ensure it does not contain any PII .. for CrOS on amd devices I'd be happy to walk whoever deals with amd CrOS devices through the process offline.
Regards, Christian.
BR, -R
On 2022-03-10 11:21, Sharma, Shashank wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
- Shashank
Another point I raised in another thread is that it looks like we might want to use devcoredump during ASIC reset to dump more HW related data which is useful for debugging. It means the use client will have to extract the pid and process name out from a bigger data set - Is that ok ? We can probably put it at the beginning for easiest parsing.
Andrey
BR, -R
[1] One small change to allow-list the drm/msm devcore dumps was needed after a privacy review/audit of what is contained in the GPU devcored to ensure it does not contain any PII .. for CrOS on amd devices I'd be happy to walk whoever deals with amd CrOS devices through the process offline.
Regards, Christian.
BR, -R
On Thu, Mar 10, 2022 at 8:27 AM Andrey Grodzovsky andrey.grodzovsky@amd.com wrote:
On 2022-03-10 11:21, Sharma, Shashank wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
- Shashank
Another point I raised in another thread is that it looks like we might want to use devcoredump during ASIC reset to dump more HW related data which is useful for debugging. It means the use client will have to extract the pid and process name out from a bigger data set - Is that ok ? We can probably put it at the beginning for easiest parsing.
Yes, this is what we do for drm/msm.. the start of the devcore file has something like:
---- kernel: 5.14.0-rc3-debug+ module: msm time: 1632763923.453207637 comm: deqp-gles3:sq0 cmdline: ./deqp-gles31 --deqp-surface-width=256 --deqp-surface-height=256 --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-visibility=hidden --deqp-caselist-file=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.caselist.txt --deqp-log-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.qpa --deqp-log-flush=disable --deqp-shadercache-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/t499826814672.shader_cache --deqp-shadercache-truncate=disable revision: 618 (6.1.8.0) ----
We capture quite a lot of state, cmdstream that triggered the hang, register/state dumps, microcontroller state, etc. But we do go out of our way to not capture textures or caches that might contain texture data by default (for privacy reasons)
It has been hugely useful for debugging a few issues that happen rarely enough that they are difficult to reproduce. I guess that is crowd-sourced debugging ;-)
BR, -R
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
BR, -R
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
- Shashank
BR, -R
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote: > From: Shashank Sharma shashank.sharma@amd.com > > This patch adds a new sysfs event, which will indicate > the userland about a GPU reset, and can also provide > some information like: > - process ID of the process involved with the GPU reset > - process name of the involved process > - the GPU status info (using flags) > > This patch also introduces the first flag of the flags > bitmap, which can be appended as and when required. Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
I don't think we need (or should encourage/allow) something drm specific when there is already an existing solution used by both drm and non-drm drivers. Userspace should not have to learn to support yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter...
BR, -R
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark: > On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma > contactshashanksharma@gmail.com wrote: >> From: Shashank Sharma shashank.sharma@amd.com >> >> This patch adds a new sysfs event, which will indicate >> the userland about a GPU reset, and can also provide >> some information like: >> - process ID of the process involved with the GPU reset >> - process name of the involved process >> - the GPU status info (using flags) >> >> This patch also introduces the first flag of the flags >> bitmap, which can be appended as and when required. > Why invent something new, rather than using the already existing devcoredump?
Yeah, that's a really valid question.
> I don't think we need (or should encourage/allow) something drm > specific when there is already an existing solution used by both drm > and non-drm drivers. Userspace should not have to learn to support > yet another mechanism to do the same thing.
Question is how is userspace notified about new available core dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter...
Yes, that is correct. the uevent for devcoredump is from device_add()
https://github.com/torvalds/linux/blob/master/drivers/base/core.c#L3340
BR, -R
On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König ckoenig.leichtzumerken@gmail.com wrote: > > > > Am 09.03.22 um 19:12 schrieb Rob Clark: >> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma >> contactshashanksharma@gmail.com wrote: >>> From: Shashank Sharma shashank.sharma@amd.com >>> >>> This patch adds a new sysfs event, which will indicate >>> the userland about a GPU reset, and can also provide >>> some information like: >>> - process ID of the process involved with the GPU reset >>> - process name of the involved process >>> - the GPU status info (using flags) >>> >>> This patch also introduces the first flag of the flags >>> bitmap, which can be appended as and when required. >> Why invent something new, rather than using the already existing >> devcoredump? > > Yeah, that's a really valid question. > >> I don't think we need (or should encourage/allow) something drm >> specific when there is already an existing solution used by both >> drm >> and non-drm drivers. Userspace should not have to learn to support >> yet another mechanism to do the same thing. > > Question is how is userspace notified about new available core > dumps?
I haven't looked into it too closely, as the CrOS userspace crash-reporter already had support for devcoredump, so it "just worked" out of the box[1]. I believe a udev event is what triggers the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.g...
Yes, that is correct. the uevent for devcoredump is from device_add()
Yes, I could confirm in the code that device_add() sends a uevent.
kobject_uevent(&dev->kobj, KOBJ_ADD);
I was trying to map the ChromiumOs's udev event rules with the event being sent from device_add(), what I could see is there is only one udev rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \ RUN+="/sbin/crash_reporter --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
Can someone confirm that this is the rule which gets triggered when a devcoredump is generated ? I could not find an ERROR=1 string in the env[] while sending this event from dev_add();
- Shashank
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
BR, -R
On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote: > On Thu, Mar 10, 2022 at 1:55 AM Christian König > ckoenig.leichtzumerken@gmail.com wrote: >> >> >> >> Am 09.03.22 um 19:12 schrieb Rob Clark: >>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma >>> contactshashanksharma@gmail.com wrote: >>>> From: Shashank Sharma shashank.sharma@amd.com >>>> >>>> This patch adds a new sysfs event, which will indicate >>>> the userland about a GPU reset, and can also provide >>>> some information like: >>>> - process ID of the process involved with the GPU reset >>>> - process name of the involved process >>>> - the GPU status info (using flags) >>>> >>>> This patch also introduces the first flag of the flags >>>> bitmap, which can be appended as and when required. >>> Why invent something new, rather than using the already existing >>> devcoredump? >> >> Yeah, that's a really valid question. >> >>> I don't think we need (or should encourage/allow) something drm >>> specific when there is already an existing solution used by both >>> drm >>> and non-drm drivers. Userspace should not have to learn to support >>> yet another mechanism to do the same thing. >> >> Question is how is userspace notified about new available core >> dumps? > > I haven't looked into it too closely, as the CrOS userspace > crash-reporter already had support for devcoredump, so it "just > worked" out of the box[1]. I believe a udev event is what triggers > the crash-reporter to go read the devcore dump out of sysfs.
I had a quick look at the devcoredump code, and it doesn't look like that is sending an event to the user, so we still need an event to indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.g...
Yes, that is correct. the uevent for devcoredump is from device_add()
Yes, I could confirm in the code that device_add() sends a uevent.
kobject_uevent(&dev->kobj, KOBJ_ADD);
I was trying to map the ChromiumOs's udev event rules with the event being sent from device_add(), what I could see is there is only one udev rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \ RUN+="/sbin/crash_reporter --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
Can someone confirm that this is the rule which gets triggered when a devcoredump is generated ? I could not find an ERROR=1 string in the env[] while sending this event from dev_add();
I think it is actually this rule:
ACTION=="add", SUBSYSTEM=="devcoredump", \ RUN+="/sbin/crash_reporter --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
It is something non-drm specific because it supports devcore dumps from non drm drivers. I know at least some of the wifi and remoteproc drivers use it.
BR, -R
On 3/10/2022 8:35 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank shashank.sharma@amd.com wrote: > > > > On 3/10/2022 4:24 PM, Rob Clark wrote: >> On Thu, Mar 10, 2022 at 1:55 AM Christian König >> ckoenig.leichtzumerken@gmail.com wrote: >>> >>> >>> >>> Am 09.03.22 um 19:12 schrieb Rob Clark: >>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma >>>> contactshashanksharma@gmail.com wrote: >>>>> From: Shashank Sharma shashank.sharma@amd.com >>>>> >>>>> This patch adds a new sysfs event, which will indicate >>>>> the userland about a GPU reset, and can also provide >>>>> some information like: >>>>> - process ID of the process involved with the GPU reset >>>>> - process name of the involved process >>>>> - the GPU status info (using flags) >>>>> >>>>> This patch also introduces the first flag of the flags >>>>> bitmap, which can be appended as and when required. >>>> Why invent something new, rather than using the already existing >>>> devcoredump? >>> >>> Yeah, that's a really valid question. >>> >>>> I don't think we need (or should encourage/allow) something drm >>>> specific when there is already an existing solution used by both >>>> drm >>>> and non-drm drivers. Userspace should not have to learn to support >>>> yet another mechanism to do the same thing. >>> >>> Question is how is userspace notified about new available core >>> dumps? >> >> I haven't looked into it too closely, as the CrOS userspace >> crash-reporter already had support for devcoredump, so it "just >> worked" out of the box[1]. I believe a udev event is what triggers >> the crash-reporter to go read the devcore dump out of sysfs. > > I had a quick look at the devcoredump code, and it doesn't look like > that is sending an event to the user, so we still need an event to > indicate a GPU reset.
There definitely is an event to userspace, I suspect somewhere down the device_add() path?
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.g...
Yes, that is correct. the uevent for devcoredump is from device_add()
Yes, I could confirm in the code that device_add() sends a uevent.
kobject_uevent(&dev->kobj, KOBJ_ADD);
I was trying to map the ChromiumOs's udev event rules with the event being sent from device_add(), what I could see is there is only one udev rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \ RUN+="/sbin/crash_reporter --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
Can someone confirm that this is the rule which gets triggered when a devcoredump is generated ? I could not find an ERROR=1 string in the env[] while sending this event from dev_add();
I think it is actually this rule:
ACTION=="add", SUBSYSTEM=="devcoredump", \ RUN+="/sbin/crash_reporter --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
It is something non-drm specific because it supports devcore dumps from non drm drivers. I know at least some of the wifi and remoteproc drivers use it.
Ah, this seems like a problem for me. I understand it will work for a reset/recovery app well, but if a DRM client (like a compositor), who wants to listen only to DRM events (like a GPU reset), wouldn't this create a lot of noise for it ? Like every time any subsystem produces this coredump, there will be a change in devcoresump subsystem, and the client will have to parse the core file, and then will have to decide if it wants to react to this, or ignore.
Wouldn't a GPU reset event, specific to DRM subsystem server better in such case ?
- Shashank
BR, -R
On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 8:35 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 6:10 PM, Rob Clark wrote: > On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank > shashank.sharma@amd.com wrote: >> >> >> >> On 3/10/2022 4:24 PM, Rob Clark wrote: >>> On Thu, Mar 10, 2022 at 1:55 AM Christian König >>> ckoenig.leichtzumerken@gmail.com wrote: >>>> >>>> >>>> >>>> Am 09.03.22 um 19:12 schrieb Rob Clark: >>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma >>>>> contactshashanksharma@gmail.com wrote: >>>>>> From: Shashank Sharma shashank.sharma@amd.com >>>>>> >>>>>> This patch adds a new sysfs event, which will indicate >>>>>> the userland about a GPU reset, and can also provide >>>>>> some information like: >>>>>> - process ID of the process involved with the GPU reset >>>>>> - process name of the involved process >>>>>> - the GPU status info (using flags) >>>>>> >>>>>> This patch also introduces the first flag of the flags >>>>>> bitmap, which can be appended as and when required. >>>>> Why invent something new, rather than using the already existing >>>>> devcoredump? >>>> >>>> Yeah, that's a really valid question. >>>> >>>>> I don't think we need (or should encourage/allow) something drm >>>>> specific when there is already an existing solution used by both >>>>> drm >>>>> and non-drm drivers. Userspace should not have to learn to support >>>>> yet another mechanism to do the same thing. >>>> >>>> Question is how is userspace notified about new available core >>>> dumps? >>> >>> I haven't looked into it too closely, as the CrOS userspace >>> crash-reporter already had support for devcoredump, so it "just >>> worked" out of the box[1]. I believe a udev event is what triggers >>> the crash-reporter to go read the devcore dump out of sysfs. >> >> I had a quick look at the devcoredump code, and it doesn't look like >> that is sending an event to the user, so we still need an event to >> indicate a GPU reset. > > There definitely is an event to userspace, I suspect somewhere down > the device_add() path? >
Let me check that out as well, hope that is not due to a driver-private event for GPU reset, coz I think I have seen some of those in a few DRM drivers.
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.g...
Yes, that is correct. the uevent for devcoredump is from device_add()
Yes, I could confirm in the code that device_add() sends a uevent.
kobject_uevent(&dev->kobj, KOBJ_ADD);
I was trying to map the ChromiumOs's udev event rules with the event being sent from device_add(), what I could see is there is only one udev rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \ RUN+="/sbin/crash_reporter --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
Can someone confirm that this is the rule which gets triggered when a devcoredump is generated ? I could not find an ERROR=1 string in the env[] while sending this event from dev_add();
I think it is actually this rule:
ACTION=="add", SUBSYSTEM=="devcoredump", \ RUN+="/sbin/crash_reporter --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
It is something non-drm specific because it supports devcore dumps from non drm drivers. I know at least some of the wifi and remoteproc drivers use it.
Ah, this seems like a problem for me. I understand it will work for a reset/recovery app well, but if a DRM client (like a compositor), who wants to listen only to DRM events (like a GPU reset), wouldn't this create a lot of noise for it ? Like every time any subsystem produces this coredump, there will be a change in devcoresump subsystem, and the client will have to parse the core file, and then will have to decide if it wants to react to this, or ignore.
Wouldn't a GPU reset event, specific to DRM subsystem server better in such case ?
So, I suppose there are two different use-cases.. for something like distro which has generic crash telemetry (ie. when users opt in to automated crash reporting), and in general for debugging gpu crashes, you want devcoredump, preferably with plenty of information about gpu state, etc, so you actually have a chance of debugging problems you can't necessarily reproduce locally. Note also that mesa CI has some limited support for collecting devcore dumps if a CI run triggers a GPU fault.
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
BR, -R
On 3/10/2022 8:56 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 8:35 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank shashank.sharma@amd.com wrote:
On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
On 3/10/2022 9:40 AM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank shashank.sharma@amd.com wrote: > > > > On 3/10/2022 6:10 PM, Rob Clark wrote: >> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank >> shashank.sharma@amd.com wrote: >>> >>> >>> >>> On 3/10/2022 4:24 PM, Rob Clark wrote: >>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König >>>> ckoenig.leichtzumerken@gmail.com wrote: >>>>> >>>>> >>>>> >>>>> Am 09.03.22 um 19:12 schrieb Rob Clark: >>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma >>>>>> contactshashanksharma@gmail.com wrote: >>>>>>> From: Shashank Sharma shashank.sharma@amd.com >>>>>>> >>>>>>> This patch adds a new sysfs event, which will indicate >>>>>>> the userland about a GPU reset, and can also provide >>>>>>> some information like: >>>>>>> - process ID of the process involved with the GPU reset >>>>>>> - process name of the involved process >>>>>>> - the GPU status info (using flags) >>>>>>> >>>>>>> This patch also introduces the first flag of the flags >>>>>>> bitmap, which can be appended as and when required. >>>>>> Why invent something new, rather than using the already existing >>>>>> devcoredump? >>>>> >>>>> Yeah, that's a really valid question. >>>>> >>>>>> I don't think we need (or should encourage/allow) something drm >>>>>> specific when there is already an existing solution used by both >>>>>> drm >>>>>> and non-drm drivers. Userspace should not have to learn to support >>>>>> yet another mechanism to do the same thing. >>>>> >>>>> Question is how is userspace notified about new available core >>>>> dumps? >>>> >>>> I haven't looked into it too closely, as the CrOS userspace >>>> crash-reporter already had support for devcoredump, so it "just >>>> worked" out of the box[1]. I believe a udev event is what triggers >>>> the crash-reporter to go read the devcore dump out of sysfs. >>> >>> I had a quick look at the devcoredump code, and it doesn't look like >>> that is sending an event to the user, so we still need an event to >>> indicate a GPU reset. >> >> There definitely is an event to userspace, I suspect somewhere down >> the device_add() path? >> > > Let me check that out as well, hope that is not due to a driver-private > event for GPU reset, coz I think I have seen some of those in a few DRM > drivers. >
Definitely no driver private event for drm/msm .. I haven't dug through it all but this is the collector for devcoredump, triggered somehow via udev. Most likely from event triggered by device_add()
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.g...
Yes, that is correct. the uevent for devcoredump is from device_add()
Yes, I could confirm in the code that device_add() sends a uevent.
kobject_uevent(&dev->kobj, KOBJ_ADD);
I was trying to map the ChromiumOs's udev event rules with the event being sent from device_add(), what I could see is there is only one udev rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \ RUN+="/sbin/crash_reporter --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
Can someone confirm that this is the rule which gets triggered when a devcoredump is generated ? I could not find an ERROR=1 string in the env[] while sending this event from dev_add();
I think it is actually this rule:
ACTION=="add", SUBSYSTEM=="devcoredump", \ RUN+="/sbin/crash_reporter --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
It is something non-drm specific because it supports devcore dumps from non drm drivers. I know at least some of the wifi and remoteproc drivers use it.
Ah, this seems like a problem for me. I understand it will work for a reset/recovery app well, but if a DRM client (like a compositor), who wants to listen only to DRM events (like a GPU reset), wouldn't this create a lot of noise for it ? Like every time any subsystem produces this coredump, there will be a change in devcoresump subsystem, and the client will have to parse the core file, and then will have to decide if it wants to react to this, or ignore.
Wouldn't a GPU reset event, specific to DRM subsystem server better in such case ?
So, I suppose there are two different use-cases.. for something like distro which has generic crash telemetry (ie. when users opt in to automated crash reporting), and in general for debugging gpu crashes, you want devcoredump, preferably with plenty of information about gpu state, etc, so you actually have a chance of debugging problems you can't necessarily reproduce locally. Note also that mesa CI has some limited support for collecting devcore dumps if a CI run triggers a GPU fault.
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Exactly, in fact that should have been a part of my cover-letter. May be I will send a V2 with better doc.
- Shashank
BR, -R
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
Thanks, pq
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs. You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Alex
Thanks, pq
On Mon, 14 Mar 2022 10:23:27 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs.
Hi,
how would this new event help with that?
I mean, yeah, there could be a daemon that kills those GPU users, but then what? You still lose any unsaved work, and may need to manually restart them.
Is the idea that it is better to have the app crash and disappear than to look like it froze while it otherwise still runs?
If some daemon or compositor goes killing apps that trigger GPU resets, then how do we stop that for an app that actually does use the appropriate EGL or Vulkan APIs to detect and remedy that situation itself?
You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Why should something like a compositor listen for this and kill apps that triggered GPU resets, instead of e.g. Mesa noticing that in the app and killing itself? Mesa in the app would know if robustness API is being used.
Would be really nice to have the answers to all these questions to be collected and reiterated in the next version of this proposal.
Thanks, pq
On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Mon, 14 Mar 2022 10:23:27 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs.
Hi,
how would this new event help with that?
This event would provide notification that a GPU reset occurred.
I mean, yeah, there could be a daemon that kills those GPU users, but then what? You still lose any unsaved work, and may need to manually restart them.
Is the idea that it is better to have the app crash and disappear than to look like it froze while it otherwise still runs?
Yes. The daemon could also send the user some sort of notification that a GPU reset occurred.
If some daemon or compositor goes killing apps that trigger GPU resets, then how do we stop that for an app that actually does use the appropriate EGL or Vulkan APIs to detect and remedy that situation itself?
I guess the daemon could keep some sort of whitelist. OTOH, very few if any applications, especially games actually support these extensions.
You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Why should something like a compositor listen for this and kill apps that triggered GPU resets, instead of e.g. Mesa noticing that in the app and killing itself? Mesa in the app would know if robustness API is being used.
That's another possibility, but it doesn't handle the case where the compositor doesn't support any sort of robustness extension so if the GPU was reset, you'd lose your desktop anyway even if the app kept running.
Would be really nice to have the answers to all these questions to be collected and reiterated in the next version of this proposal.
The idea is to provide the notification of a GPU reset. What the various desktop environments or daemons do with it is up to them. I still think there is value in a notification even if you don't kill apps or anything like that. E.g., you can have a daemon running that gets notified and logs the error, collects debug info, sends an email, etc.
Alex
Thanks, pq
On Tue, 15 Mar 2022 10:54:38 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Mon, 14 Mar 2022 10:23:27 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs.
Hi,
how would this new event help with that?
This event would provide notification that a GPU reset occurred.
I mean, yeah, there could be a daemon that kills those GPU users, but then what? You still lose any unsaved work, and may need to manually restart them.
Is the idea that it is better to have the app crash and disappear than to look like it froze while it otherwise still runs?
Yes.
Ok. That was just a wild guess.
The daemon could also send the user some sort of notification that a GPU reset occurred.
If some daemon or compositor goes killing apps that trigger GPU resets, then how do we stop that for an app that actually does use the appropriate EGL or Vulkan APIs to detect and remedy that situation itself?
I guess the daemon could keep some sort of whitelist. OTOH, very few if any applications, especially games actually support these extensions.
I would think that a white-list is a non-starter due to the maintenance nightmare and the "wrong by default" design for well behaving programs.
I am not a fan of optimising for broken software. I understand that with games this is routine, but we're talking about everything here, not just games. Games cannot be fixed, but the rest could if the problem was not sweeped under the rug. It's like the inverse of the platform problem.
You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Why should something like a compositor listen for this and kill apps that triggered GPU resets, instead of e.g. Mesa noticing that in the app and killing itself? Mesa in the app would know if robustness API is being used.
That's another possibility, but it doesn't handle the case where the compositor doesn't support any sort of robustness extension so if the GPU was reset, you'd lose your desktop anyway even if the app kept running.
Why does that matter?
A GPU reset happens when it happens. If a compositor does not use robustness extensions, it's as good as dead anyway, right?
Killing a compositor from inside in Mesa if it doesn't use robustness might be better than leaving the compositor running blind - assuming the compositor does not quit itself after seeing crucial EGL/Vulkan calls failing.
Would be really nice to have the answers to all these questions to be collected and reiterated in the next version of this proposal.
The idea is to provide the notification of a GPU reset. What the various desktop environments or daemons do with it is up to them. I still think there is value in a notification even if you don't kill apps or anything like that. E.g., you can have a daemon running that gets notified and logs the error, collects debug info, sends an email, etc.
With new UAPI comes the demand of userspace proof, not hand-waving. You would not be proposing this new interface if you didn't have use cases in mind, even just one. You have to document what you imagine the new thing to be used for, so that the appropriateness can be evaluated. If the use case is deemed inappropriate for the proposed UAPI, you need to find another use case to justify adding the new UAPI. If there is no use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping someone finds use for it seems backwards to me.
Thanks, pq
On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Tue, 15 Mar 2022 10:54:38 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Mon, 14 Mar 2022 10:23:27 -0400 Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs.
Hi,
how would this new event help with that?
This event would provide notification that a GPU reset occurred.
I mean, yeah, there could be a daemon that kills those GPU users, but then what? You still lose any unsaved work, and may need to manually restart them.
Is the idea that it is better to have the app crash and disappear than to look like it froze while it otherwise still runs?
Yes.
Ok. That was just a wild guess.
The daemon could also send the user some sort of notification that a GPU reset occurred.
If some daemon or compositor goes killing apps that trigger GPU resets, then how do we stop that for an app that actually does use the appropriate EGL or Vulkan APIs to detect and remedy that situation itself?
I guess the daemon could keep some sort of whitelist. OTOH, very few if any applications, especially games actually support these extensions.
I would think that a white-list is a non-starter due to the maintenance nightmare and the "wrong by default" design for well behaving programs.
I am not a fan of optimising for broken software. I understand that with games this is routine, but we're talking about everything here, not just games. Games cannot be fixed, but the rest could if the problem was not sweeped under the rug. It's like the inverse of the platform problem.
Fair enough, but it hasn't happened in the last 15 years or so.
You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Why should something like a compositor listen for this and kill apps that triggered GPU resets, instead of e.g. Mesa noticing that in the app and killing itself? Mesa in the app would know if robustness API is being used.
That's another possibility, but it doesn't handle the case where the compositor doesn't support any sort of robustness extension so if the GPU was reset, you'd lose your desktop anyway even if the app kept running.
Why does that matter?
A GPU reset happens when it happens. If a compositor does not use robustness extensions, it's as good as dead anyway, right?
Right. Same with the application that supports robustness. If the app supports robustness, but the compositor does not, the app is dead anyway if the compositor dies. So while there may be an application today that supports robustness, it is kind of pointless because nothing else does and so if it manages to recover, nothing else it relies on does.
Killing a compositor from inside in Mesa if it doesn't use robustness might be better than leaving the compositor running blind - assuming the compositor does not quit itself after seeing crucial EGL/Vulkan calls failing.
I'm not sure I follow the second part of this statement. I guess having mesa kill applications that don't support robustness is fine, but I don't really see it as much better than the status quo. Today most apps just continue to try and run until, possibly dying eventually due to undefined state along the way. If mesa kills them, they are killed. The end result is the same, the user loses their desktop.
Would be really nice to have the answers to all these questions to be collected and reiterated in the next version of this proposal.
The idea is to provide the notification of a GPU reset. What the various desktop environments or daemons do with it is up to them. I still think there is value in a notification even if you don't kill apps or anything like that. E.g., you can have a daemon running that gets notified and logs the error, collects debug info, sends an email, etc.
With new UAPI comes the demand of userspace proof, not hand-waving. You would not be proposing this new interface if you didn't have use cases in mind, even just one. You have to document what you imagine the new thing to be used for, so that the appropriateness can be evaluated. If the use case is deemed inappropriate for the proposed UAPI, you need to find another use case to justify adding the new UAPI. If there is no use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping someone finds use for it seems backwards to me.
We do have a use case. It's what I described originally. There is a user space daemon (could be a compositor, could be something else) that runs and listens for GPU reset notifications. When it receives a notification, it takes action and kills the guilty app and restarts the compositer and gathers any relevant data related to the GPU hang (if possible). We can revisit this discussion once we have the whole implementation complete. Other drivers seem to do similar things already today via different means (msm using devcoredump, i915 seems to have its own GPU reset notification mechanism, etc.). It just seemed like there was value in having a generic drm GPU reset notification, but maybe not yet.
Alex
Thanks, pq
On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher alexdeucher@gmail.com wrote:
On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen ppaalanen@gmail.com wrote:
[snip]
With new UAPI comes the demand of userspace proof, not hand-waving. You would not be proposing this new interface if you didn't have use cases in mind, even just one. You have to document what you imagine the new thing to be used for, so that the appropriateness can be evaluated. If the use case is deemed inappropriate for the proposed UAPI, you need to find another use case to justify adding the new UAPI. If there is no use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping someone finds use for it seems backwards to me.
We do have a use case. It's what I described originally. There is a user space daemon (could be a compositor, could be something else) that runs and listens for GPU reset notifications. When it receives a notification, it takes action and kills the guilty app and restarts the compositer and gathers any relevant data related to the GPU hang (if possible). We can revisit this discussion once we have the whole implementation complete. Other drivers seem to do similar things already today via different means (msm using devcoredump, i915 seems to have its own GPU reset notification mechanism, etc.). It just seemed like there was value in having a generic drm GPU reset notification, but maybe not yet.
just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
BR, -R
On Wed, Mar 16, 2022 at 11:35 AM Rob Clark robdclark@gmail.com wrote:
On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher alexdeucher@gmail.com wrote:
On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen ppaalanen@gmail.com wrote:
[snip]
With new UAPI comes the demand of userspace proof, not hand-waving. You would not be proposing this new interface if you didn't have use cases in mind, even just one. You have to document what you imagine the new thing to be used for, so that the appropriateness can be evaluated. If the use case is deemed inappropriate for the proposed UAPI, you need to find another use case to justify adding the new UAPI. If there is no use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping someone finds use for it seems backwards to me.
We do have a use case. It's what I described originally. There is a user space daemon (could be a compositor, could be something else) that runs and listens for GPU reset notifications. When it receives a notification, it takes action and kills the guilty app and restarts the compositer and gathers any relevant data related to the GPU hang (if possible). We can revisit this discussion once we have the whole implementation complete. Other drivers seem to do similar things already today via different means (msm using devcoredump, i915 seems to have its own GPU reset notification mechanism, etc.). It just seemed like there was value in having a generic drm GPU reset notification, but maybe not yet.
just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
Sure. you could use this proposed event for telemetry gathering as well. The important part is the event. What you do with it is up to the user.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Agreed. I think devcoredump makes sense and we'll ultimately enable support for that in amdgpu. I think there is value in a GPU specific reset event as well for the use case I outlined, but maybe the devcoredump one is enough. We'd just need to rely on the userspace side to filter as appropriate for events it cares about. I'm not sure how many other devcore dump events are commonly seen.
Alex
On Wed, Mar 16, 2022 at 8:48 AM Alex Deucher alexdeucher@gmail.com wrote:
On Wed, Mar 16, 2022 at 11:35 AM Rob Clark robdclark@gmail.com wrote:
On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher alexdeucher@gmail.com wrote:
On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen ppaalanen@gmail.com wrote:
[snip]
With new UAPI comes the demand of userspace proof, not hand-waving. You would not be proposing this new interface if you didn't have use cases in mind, even just one. You have to document what you imagine the new thing to be used for, so that the appropriateness can be evaluated. If the use case is deemed inappropriate for the proposed UAPI, you need to find another use case to justify adding the new UAPI. If there is no use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping someone finds use for it seems backwards to me.
We do have a use case. It's what I described originally. There is a user space daemon (could be a compositor, could be something else) that runs and listens for GPU reset notifications. When it receives a notification, it takes action and kills the guilty app and restarts the compositer and gathers any relevant data related to the GPU hang (if possible). We can revisit this discussion once we have the whole implementation complete. Other drivers seem to do similar things already today via different means (msm using devcoredump, i915 seems to have its own GPU reset notification mechanism, etc.). It just seemed like there was value in having a generic drm GPU reset notification, but maybe not yet.
just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
Sure. you could use this proposed event for telemetry gathering as well. The important part is the event. What you do with it is up to the user.
True, but for that purpose (telemetry) let's leverage something that userspace already has support for
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Agreed. I think devcoredump makes sense and we'll ultimately enable support for that in amdgpu. I think there is value in a GPU specific reset event as well for the use case I outlined, but maybe the devcoredump one is enough. We'd just need to rely on the userspace side to filter as appropriate for events it cares about. I'm not sure how many other devcore dump events are commonly seen.
They are, like CPU coredumps, not expected to be a frequent thing. There are a number of other non-GPU drivers which also use devcoredump (mostly wifi and remoteproc, ie. things like DSPs). We did also recently add msm devcoredump support for the display side of things, in addition to gpu, to dump display state if we hit underruns, link training fail, etc. Each "crash" creates a new transient devcore device so crashes from different devices can be reported in parallel. But we do rate-limit the GPU crash reports by not reporting another one until the previous one is read out and cleared by userspace. (You don't strictly need to do that, but it is probably a good idea.. usually the first crash is the interesting one.)
The ChromeOS crash-reporter does have an allow-list for which devcore dumps are sent back, mainly because in the case of wifi drivers it is hard to ensure that the dump does not include PII (like network names, etc). It would be pretty trivial to add amdgpu to the allow-list and setup a "magic signature" so that queries could be run on amdgpu devcore dumps. I can help out with that, since I went through the same process already for msm.
I'm not entirely sure what other distros do in this area.
BR, -R
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
Thanks, Christian.
BR, -R
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Am 17.03.22 um 10:29 schrieb Daniel Vetter:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Ok, good to know that it's as simple as that.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens.
This is what vk wants
That's exactly what I'm telling an internal team for a couple of years now as well. Good to know that this is not that totally crazy.
and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
Interesting idea, so basically we only do the state we need to reset initially and grab a reference on the killed application to gather the rest before we clean them up.
Going to keep that in mind as well.
Thanks, Christian.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
We do one GEM obj allocation in the snapshot path (the hw has a mechanism to snapshot it's own state into a gpu buffer.. not sure if nice debugging functionality like that is a commentary on the blob driver quality, but I'm not complaining)
I suppose we could pre-allocate this buffer up-front.. but it doesn't seem like a problem, ie. if allocation fails we just skip snapshotting stuff that needs the hw crashdumper. I guess since vram is not involved, perhaps that makes the situation a bit more straightforward.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
Note that the bo's associated with the batch are still pinned at this point, from the bo lifecycle the batch is still active. So from the point of view of shrinker, there should be no interaction. We aren't doing anything with mmu notifiers (yet), so not entirely sure offhand the concern there.
Currently we just use GFP_KERNEL and bail if allocation fails.
BR, -R
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 17, 2022 at 08:34:21AM -0700, Rob Clark wrote:
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
We do one GEM obj allocation in the snapshot path (the hw has a mechanism to snapshot it's own state into a gpu buffer.. not sure if nice debugging functionality like that is a commentary on the blob driver quality, but I'm not complaining)
I suppose we could pre-allocate this buffer up-front.. but it doesn't seem like a problem, ie. if allocation fails we just skip snapshotting stuff that needs the hw crashdumper. I guess since vram is not involved, perhaps that makes the situation a bit more straightforward.
The problem is that you need to allocate with GFP_ATOMIC, instead of GFP_KERNEL, or things go very bad.
The scheduler dma-fence annotations I've had (well still have them here) would catch this stuff, but thus far they got nowhere.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
Note that the bo's associated with the batch are still pinned at this point, from the bo lifecycle the batch is still active. So from the point of view of shrinker, there should be no interaction. We aren't doing anything with mmu notifiers (yet), so not entirely sure offhand the concern there.
Currently we just use GFP_KERNEL and bail if allocation fails.
Yeah you have a simple enough shrinker for this not to be a problem. The issue is that sooner or later things tend to not stay like that, and we're trying to have common rules for dma_fence to make sure everyone follows the same rules. -Daniel
BR, -R
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
fwiw, we snapshot everything (cmdstream and bo's marked with dump flag, in addition to hw state) before resuming the GPU, so there is no danger of things being trampled. After state is captured and GPU reset, we "replay" the submits that were written into the ringbuffer after the faulting submit. GPU crashes should be a thing you don't need to try to optimize.
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
BR, -R
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
fwiw, we snapshot everything (cmdstream and bo's marked with dump flag, in addition to hw state) before resuming the GPU, so there is no danger of things being trampled. After state is captured and GPU reset, we "replay" the submits that were written into the ringbuffer after the faulting submit. GPU crashes should be a thing you don't need to try to optimize.
Not sure why you think we optimize anything here?
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
BR, -R
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP] just one point of clarification.. in the msm and i915 case it is purely for debugging and telemetry (ie. sending crash logs back to distro for analysis if user has crash reporting enabled).. it isn't used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Note that the fancy new plan for i915 discrete gpu is to only support gpu crash dumps on non-recoverable gpu contexts, i.e. those that do not continue to the next batch when something bad happens. This is what vk wants and also what iris now uses (we do context recovery in userspace in all cases), and non-recoverable contexts greatly simplify the crash dump gather: Only thing you need to gather is the register state from hw (before you reset it), all the batchbuffer bo and indirect state bo (in i915 you can mark which bo to capture in the CS ioctl) can be captured in a worker later on. Which for non-recoverable context is no issue, since subsequent batchbuffers won't trample over any of these things.
fwiw, we snapshot everything (cmdstream and bo's marked with dump flag, in addition to hw state) before resuming the GPU, so there is no danger of things being trampled. After state is captured and GPU reset, we "replay" the submits that were written into the ringbuffer after the faulting submit. GPU crashes should be a thing you don't need to try to optimize.
Not sure why you think we optimize anything here?
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
BR, -R
And that way you can record the crashdump (or at least the big pieces like all the indirect state stuff) with GFP_KERNEL.
msm probably gets it wrong since embedded drivers have much less shrinker and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset path.
I would however *strongly* recommend devcoredump support in other GPU drivers (i915's thing pre-dates devcoredump by a lot).. I've used it to debug and fix a couple obscure issues that I was not able to reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 17.03.22 um 18:31 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
[SNIP]
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver specific and don't use anything from the infrastructure in the DRM scheduler.
Thanks, Christian.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
On Fri, Mar 18, 2022 at 12:42 AM Christian König christian.koenig@amd.com wrote:
Am 17.03.22 um 18:31 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
[SNIP]
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver specific and don't use anything from the infrastructure in the DRM scheduler.
hmm, perhaps I was not clear, but I'm only talking about re-emitting jobs *following* the faulting one (which could be from other contexts, etc).. not trying to restart the faulting job.
You *absolutely* need to replay jobs following the faulting one, they could be from unrelated contexts/processes. You can't just drop them on the floor.
Currently it is all driver specific, but I wanted to delete a lot of code and move to using scheduler to handle faults/timeouts (but blocked on that until [1] is resolved)
[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-...
BR, -R
Thanks, Christian.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
Am 18.03.22 um 16:12 schrieb Rob Clark:
On Fri, Mar 18, 2022 at 12:42 AM Christian König christian.koenig@amd.com wrote:
Am 17.03.22 um 18:31 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
[SNIP]
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver specific and don't use anything from the infrastructure in the DRM scheduler.
hmm, perhaps I was not clear, but I'm only talking about re-emitting jobs *following* the faulting one (which could be from other contexts, etc).. not trying to restart the faulting job.
You *absolutely* need to replay jobs following the faulting one, they could be from unrelated contexts/processes. You can't just drop them on the floor.
Well you can, it just means that their contexts are lost as well.
If you re-submit jobs which were already pushed to the hardware you absolutely need to make a couple of things sure:
1. Don't race with your hardware. E.g. you need a way to stop processing in case of a timeout and then double check once more if things haven't finished in the meantime.
2. Make absolutely sure you never re-submit an operation when it's dma-fence is already signaled. Otherwise you run into memory corruption.
3. When you have multiple engines it becomes really tricky because then even innocent jobs might have already been started on different queues which now hang.
Currently it is all driver specific, but I wanted to delete a lot of code and move to using scheduler to handle faults/timeouts (but blocked on that until [1] is resolved)
Please don't.
Especially don't use the pending_list or any of the scheduler infrastructure for GPU reset. We need to get rid of that again sooner or later.
This is extremely hardware dependent and pushing the amdgpu specific handling into the GPU scheduler was a mistake we shouldn't repeat for other drivers.
Regards, Christian.
[1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork....
BR, -R
Thanks, Christian.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Am 18.03.22 um 16:12 schrieb Rob Clark:
On Fri, Mar 18, 2022 at 12:42 AM Christian König christian.koenig@amd.com wrote:
Am 17.03.22 um 18:31 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
[SNIP]
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver specific and don't use anything from the infrastructure in the DRM scheduler.
hmm, perhaps I was not clear, but I'm only talking about re-emitting jobs *following* the faulting one (which could be from other contexts, etc).. not trying to restart the faulting job.
You *absolutely* need to replay jobs following the faulting one, they could be from unrelated contexts/processes. You can't just drop them on the floor.
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
(Which for even more lolz, in CrOS restarts the android container or vm.. which makes running android-cts deqp kinda funny)
If you re-submit jobs which were already pushed to the hardware you absolutely need to make a couple of things sure:
- Don't race with your hardware. E.g. you need a way to stop processing
in case of a timeout and then double check once more if things haven't finished in the meantime.
- Make absolutely sure you never re-submit an operation when it's
dma-fence is already signaled. Otherwise you run into memory corruption.
- When you have multiple engines it becomes really tricky because then
even innocent jobs might have already been started on different queues which now hang.
We force power-off/on the GPU to reset it which is a pretty good way to make sure we aren't racing with the GPU.
It's worked like this since pretty much the beginning, and in the early days of bringing up mesa support for a new gen we tend to exercise the gpu hang/recovery path quite a lot.. so it at least seems pretty robust ;-)
BR, -R
Currently it is all driver specific, but I wanted to delete a lot of code and move to using scheduler to handle faults/timeouts (but blocked on that until [1] is resolved)
Please don't.
Especially don't use the pending_list or any of the scheduler infrastructure for GPU reset. We need to get rid of that again sooner or later.
This is extremely hardware dependent and pushing the amdgpu specific handling into the GPU scheduler was a mistake we shouldn't repeat for other drivers.
Regards, Christian.
[1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork....
BR, -R
Thanks, Christian.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Cheers, Daniel
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
[Adding Marek and Andrey as well]
Am 23.03.22 um 16:14 schrieb Daniel Vetter:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Yes, completely agree. We have the approach of re-submitting things in the kernel and that failed quite miserable.
In other words currently a GPU reset has something like a 99% chance to get down your whole desktop.
Daniel can you briefly explain what exactly iris does when a lost context is detected without gl robustness?
It sounds like you guys got that working quite well.
Thanks, Christian.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
[AMD Official Use Only]
amdgpu has 2 resets: soft reset and hard reset.
The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).
The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.
Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.
dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.
Marek ________________________________ From: Koenig, Christian Christian.Koenig@amd.com Sent: March 23, 2022 11:25 To: Daniel Vetter daniel@ffwll.ch; Daniel Stone daniel@fooishbar.org; Olsak, Marek Marek.Olsak@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Cc: Rob Clark robdclark@gmail.com; Rob Clark robdclark@chromium.org; Sharma, Shashank Shashank.Sharma@amd.com; Christian König ckoenig.leichtzumerken@gmail.com; Somalapuram, Amaranath Amaranath.Somalapuram@amd.com; Abhinav Kumar quic_abhinavk@quicinc.com; dri-devel dri-devel@lists.freedesktop.org; amd-gfx list amd-gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com; Shashank Sharma contactshashanksharma@gmail.com Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
[Adding Marek and Andrey as well]
Am 23.03.22 um 16:14 schrieb Daniel Vetter:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Yes, completely agree. We have the approach of re-submitting things in the kernel and that failed quite miserable.
In other words currently a GPU reset has something like a 99% chance to get down your whole desktop.
Daniel can you briefly explain what exactly iris does when a lost context is detected without gl robustness?
It sounds like you guys got that working quite well.
Thanks, Christian.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
My main question is what does the iris driver better than radeonsi when the client doesn't support the robustness extension?
From Daniels description it sounds like they have at least a partial recovery mechanism in place.
Apart from that I completely agree to what you said below.
Christian.
Am 26.03.22 um 01:53 schrieb Olsak, Marek:
[AMD Official Use Only]
amdgpu has 2 resets: soft reset and hard reset.
The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).
The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.
Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.
dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.
Marek
*From:* Koenig, Christian Christian.Koenig@amd.com *Sent:* March 23, 2022 11:25 *To:* Daniel Vetter daniel@ffwll.ch; Daniel Stone daniel@fooishbar.org; Olsak, Marek Marek.Olsak@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com *Cc:* Rob Clark robdclark@gmail.com; Rob Clark robdclark@chromium.org; Sharma, Shashank Shashank.Sharma@amd.com; Christian König ckoenig.leichtzumerken@gmail.com; Somalapuram, Amaranath Amaranath.Somalapuram@amd.com; Abhinav Kumar quic_abhinavk@quicinc.com; dri-devel dri-devel@lists.freedesktop.org; amd-gfx list amd-gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com; Shashank Sharma contactshashanksharma@gmail.com *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event [Adding Marek and Andrey as well]
Am 23.03.22 um 16:14 schrieb Daniel Vetter:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Yes, completely agree. We have the approach of re-submitting things in the kernel and that failed quite miserable.
In other words currently a GPU reset has something like a 99% chance to get down your whole desktop.
Daniel can you briefly explain what exactly iris does when a lost context is detected without gl robustness?
It sounds like you guys got that working quite well.
Thanks, Christian.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
I don't know what iris does, but I would guess that the same problems as with AMD GPUs apply, making GPUs resets very fragile.
Marek
On Tue., Mar. 29, 2022, 08:14 Christian König, christian.koenig@amd.com wrote:
My main question is what does the iris driver better than radeonsi when the client doesn't support the robustness extension?
From Daniels description it sounds like they have at least a partial recovery mechanism in place.
Apart from that I completely agree to what you said below.
Christian.
Am 26.03.22 um 01:53 schrieb Olsak, Marek:
[AMD Official Use Only]
amdgpu has 2 resets: soft reset and hard reset.
The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).
The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.
Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.
dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.
Marek
*From:* Koenig, Christian Christian.Koenig@amd.com Christian.Koenig@amd.com *Sent:* March 23, 2022 11:25 *To:* Daniel Vetter daniel@ffwll.ch daniel@ffwll.ch; Daniel Stone daniel@fooishbar.org daniel@fooishbar.org; Olsak, Marek Marek.Olsak@amd.com Marek.Olsak@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Andrey.Grodzovsky@amd.com *Cc:* Rob Clark robdclark@gmail.com robdclark@gmail.com; Rob Clark robdclark@chromium.org robdclark@chromium.org; Sharma, Shashank Shashank.Sharma@amd.com Shashank.Sharma@amd.com; Christian König ckoenig.leichtzumerken@gmail.com ckoenig.leichtzumerken@gmail.com; Somalapuram, Amaranath Amaranath.Somalapuram@amd.com Amaranath.Somalapuram@amd.com; Abhinav Kumar quic_abhinavk@quicinc.com quic_abhinavk@quicinc.com; dri-devel dri-devel@lists.freedesktop.org dri-devel@lists.freedesktop.org; amd-gfx list amd-gfx@lists.freedesktop.org amd-gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com Alexander.Deucher@amd.com; Shashank Sharma contactshashanksharma@gmail.com contactshashanksharma@gmail.com *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
[Adding Marek and Andrey as well]
Am 23.03.22 um 16:14 schrieb Daniel Vetter:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org
daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com
robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Yes, completely agree. We have the approach of re-submitting things in the kernel and that failed quite miserable.
In other words currently a GPU reset has something like a 99% chance to get down your whole desktop.
Daniel can you briefly explain what exactly iris does when a lost context is detected without gl robustness?
It sounds like you guys got that working quite well.
Thanks, Christian.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote:
I don't know what iris does, but I would guess that the same problems as with AMD GPUs apply, making GPUs resets very fragile.
iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state
is I think the main call chain of how this is handled/detected. There's also a side-chain which handles -EIO from execbuf.
Also this is using non-recoverable contexts, i.e. any time they suffer from a gpu reset (either because guilty themselves, or collateral damage of a reset that shot more than just the guilty context) the context stops entirely and refuses any further execbuf with -EIO.
Cheers, Daniel
Marek
On Tue., Mar. 29, 2022, 08:14 Christian König, christian.koenig@amd.com wrote:
My main question is what does the iris driver better than radeonsi when the client doesn't support the robustness extension?
From Daniels description it sounds like they have at least a partial recovery mechanism in place.
Apart from that I completely agree to what you said below.
Christian.
Am 26.03.22 um 01:53 schrieb Olsak, Marek:
[AMD Official Use Only]
amdgpu has 2 resets: soft reset and hard reset.
The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).
The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.
Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.
dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.
Marek
*From:* Koenig, Christian Christian.Koenig@amd.com Christian.Koenig@amd.com *Sent:* March 23, 2022 11:25 *To:* Daniel Vetter daniel@ffwll.ch daniel@ffwll.ch; Daniel Stone daniel@fooishbar.org daniel@fooishbar.org; Olsak, Marek Marek.Olsak@amd.com Marek.Olsak@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Andrey.Grodzovsky@amd.com *Cc:* Rob Clark robdclark@gmail.com robdclark@gmail.com; Rob Clark robdclark@chromium.org robdclark@chromium.org; Sharma, Shashank Shashank.Sharma@amd.com Shashank.Sharma@amd.com; Christian König ckoenig.leichtzumerken@gmail.com ckoenig.leichtzumerken@gmail.com; Somalapuram, Amaranath Amaranath.Somalapuram@amd.com Amaranath.Somalapuram@amd.com; Abhinav Kumar quic_abhinavk@quicinc.com quic_abhinavk@quicinc.com; dri-devel dri-devel@lists.freedesktop.org dri-devel@lists.freedesktop.org; amd-gfx list amd-gfx@lists.freedesktop.org amd-gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com Alexander.Deucher@amd.com; Shashank Sharma contactshashanksharma@gmail.com contactshashanksharma@gmail.com *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
[Adding Marek and Andrey as well]
Am 23.03.22 um 16:14 schrieb Daniel Vetter:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org
daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com
robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Yes, completely agree. We have the approach of re-submitting things in the kernel and that failed quite miserable.
In other words currently a GPU reset has something like a 99% chance to get down your whole desktop.
Daniel can you briefly explain what exactly iris does when a lost context is detected without gl robustness?
It sounds like you guys got that working quite well.
Thanks, Christian.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill. -Daniel
Cheers, Daniel
On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill.
It would perhaps be nice if older gens which don't (yet) have per-process pgtables to have gone with the userspace-replays (although that would require a lot more tracking in userspace than what is done currently).. but fortunately those older gens don't use "state objects" which could potentially be corrupted, but instead re-emit state in cmdstream, so there is a lot less possibility for bad collateral damage. (On all the gens we also use gpu read-only buffers whenever the gpu does not need to be able to write them.)
For newer stuff, the process isolation works pretty well. In fact we recently changed MSM_PARAM_FAULTS to only report faults/hangs in the same address space, so the compositor is not even aware (and doesn't need to be aware).
BR, -R
-Daniel
Cheers, Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote:
On Fri, Mar 18, 2022 at 12:42 AM Christian König christian.koenig@amd.com wrote:
Am 17.03.22 um 18:31 schrieb Rob Clark:
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
[SNIP]
(At some point, I'd like to use scheduler for the replay, and actually use drm_sched_stop()/etc.. but last time I looked there were still some sched bugs in that area which prevented me from deleting a bunch of code ;-))
Not sure about your hw, but at least on intel replaying tends to just result in follow-on fun. And that holds even more so the more complex a workload is. This is why vk just dies immediately and does not try to replay anything, offloading it to the app. Same with arb robusteness. Afaik it's really only media and classic gl which insist that the driver stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on previous GPU hw state), so in practice replay works out pretty well. The worst case is subsequent submits from same process fail as well (if they depended on something that crashing submit failed to write back to memory.. but in that case they just crash as well and we move on to the next one.. the recent gens (a5xx+ at least) are pretty good about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver specific and don't use anything from the infrastructure in the DRM scheduler.
hmm, perhaps I was not clear, but I'm only talking about re-emitting jobs *following* the faulting one (which could be from other contexts, etc).. not trying to restart the faulting job.
You absolutely can drop jobs on the floor, this is what both anv and iris expect. They use what we call non-recoverable context, meaning when any gpu hang happens and the context is affect (whether as the guilty on, or because it was a multi-engine reset and it was victimized) we kill it entirely. No replaying, and any further execbuf ioctl fails with -EIO.
Userspace then gets to sort out the mess, which for vk is VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl iris re-creates a pile of things.
Anything in-between _is_ dropped on the floor completely.
Also note that this is obviously uapi, if you have an userspace which expect contexts to survive, then replaying makes some sense.
You *absolutely* need to replay jobs following the faulting one, they could be from unrelated contexts/processes. You can't just drop them on the floor.
Currently it is all driver specific, but I wanted to delete a lot of code and move to using scheduler to handle faults/timeouts (but blocked on that until [1] is resolved)
Yeah for the drivers where the uapi is "you can safely replay after a hang, and you're supposed to", then sharing the code is ofc a good idea.
Just wanted to make it clear that this is only one of many uapi flavours you can pick from, dropping it all on the floor is a perfectly legit approach :-) And imo it's the more robust one, and also better fits with latest apis like gl_arb_robustness or vk.
Cheers, Daniel
[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-...
BR, -R
Thanks, Christian.
BR, -R
And recovering from a mess in userspace is a lot simpler than trying to pull of the same magic in the kernel. Plus it also helps with a few of the dma_fence rules, which is a nice bonus. -Daniel
On Tue, 15 Mar 2022 at 00:23, Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Just one thing comes to mind reading this, racy PID reuse.
process 1234 does something bad to GPU. process 1234 dies in parallel to sysfs notification being sent. other process 1234 reuses the pid new process 1234 gets destroyed by receiver of sysfs notification.
Dave.
On Tuesday, March 15th, 2022 at 08:13, Dave Airlie airlied@gmail.com wrote:
Just one thing comes to mind reading this, racy PID reuse.
process 1234 does something bad to GPU. process 1234 dies in parallel to sysfs notification being sent. other process 1234 reuses the pid new process 1234 gets destroyed by receiver of sysfs notification.
This is a more general problem with PIDs. One solution is pidfd.
Am 15.03.22 um 08:13 schrieb Dave Airlie:
On Tue, 15 Mar 2022 at 00:23, Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Just one thing comes to mind reading this, racy PID reuse.
process 1234 does something bad to GPU. process 1234 dies in parallel to sysfs notification being sent. other process 1234 reuses the pid new process 1234 gets destroyed by receiver of sysfs notification.
That's a well known problem inherit to the uses of PIDs.
IIRC because of this the kernel only reuses PIDs when /proc/sys/kernel/pid_max is reached and then wraps around.
Regards, Christian.
Dave.
On Mon, Mar 14, 2022 at 10:23:27AM -0400, Alex Deucher wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
On Thu, 10 Mar 2022 11:56:41 -0800 Rob Clark robdclark@gmail.com wrote:
For something like just notifying a compositor that a gpu crash happened, perhaps drm_event is more suitable. See virtio_gpu_fence_event_create() for an example of adding new event types. Although maybe you want it to be an event which is not device specific. This isn't so much of a debugging use-case as simply notification.
Hi,
for this particular use case, are we now talking about the display device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing. How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to deliver that?
The above would be about device crashes that directly affect the display server. Is that the use case in mind here, or is it instead about notifying the display server that some application has caused a driver/hardware crash? If the latter, how should a display server react to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it explained.
The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected.
Btw. somewhat relatedly, there has been work aiming to allow graceful hot-unplug of DRM devices. There is a kernel doc outlining how the various APIs should react towards userspace when a DRM device suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness APIs. You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway.
Isn't that what arb robustness and all that stuff is for? Doing that through sysfs event sounds very wrong, since in general apps just don't have access to that. Also vk equivalent is vk_error_device_lost. Iirc both have information like whether the app was the guilty one causing the hang, or whether it was just victimized because the gpu can't do anything else than a full gpu reset which nukes everything (like amdgpu currently has, aside from the thread unblock trick in the first attempt).
And if your app/compositor doesn't use robust contexts then the userspace driver gets to do a best effort attempt at recovery, or exit(). Whatever you can do really.
Also note that you don't actually want an event, but a query ioctl (plus maybe a specific errno on your CS ioctl). Neither of the above flows supports events for gpu resets. RESET_STATS ioctl is the i915 implementation of this stuff.
For the core dump aspect yes pls devcoredump and not reinvented wheels (and i915 is a bad example here, but in defence the i915 sysfs hang event predates devcoredump).
Cheers, Daniel
Alex
Thanks, pq
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar
- move the reset information structure to DRM layer
- drop _ctx from struct name
- make pid 32 bit(than 64)
- set flag when VRAM invalid (than valid)
- add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com
drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/**
- drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
- @dev: DRM device
- @reset_info: The contextual information about the reset (like PID, flags)
- Send a uevent for the DRM device specified by @dev. This informs
- user that a GPU reset has occurred, so that an interested client
- can take any recovery or profiling measure.
- */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{
unsigned char pid_str[13];
unsigned char flags_str[15];
unsigned char pname_str[TASK_COMM_LEN + 6];
unsigned char reset_str[] = "RESET=1";
char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
if (!reset_info) {
DRM_WARN("No reset info, not sending the event\n");
return;
}
DRM_DEBUG("generating reset event\n");
snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+} +EXPORT_SYMBOL(drm_sysfs_reset_event);
/**
- drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
- change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h>
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event {
uint32_t pid;
One side note, unrelated to devcoredump vs this..
AFAIU you probably want to be passing around a `struct pid *`, and then somehow use pid_vnr() in the context of the process reading the event to get the numeric pid. Otherwise things will not do what you expect if the process triggering the crash is in a different pid namespace from the compositor.
BR, -R
uint32_t flags;
char pname[TASK_COMM_LEN];
+};
int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev); +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property); -- 2.32.0
On 3/16/2022 10:50 PM, Rob Clark wrote:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar - move the reset information structure to DRM layer - drop _ctx from struct name - make pid 32 bit(than 64) - set flag when VRAM invalid (than valid) - add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com
drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/**
- drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
- @dev: DRM device
- @reset_info: The contextual information about the reset (like PID, flags)
- Send a uevent for the DRM device specified by @dev. This informs
- user that a GPU reset has occurred, so that an interested client
- can take any recovery or profiling measure.
- */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{
unsigned char pid_str[13];
unsigned char flags_str[15];
unsigned char pname_str[TASK_COMM_LEN + 6];
unsigned char reset_str[] = "RESET=1";
char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
if (!reset_info) {
DRM_WARN("No reset info, not sending the event\n");
return;
}
DRM_DEBUG("generating reset event\n");
snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+} +EXPORT_SYMBOL(drm_sysfs_reset_event);
- /**
- drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
- change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h>
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event {
uint32_t pid;
One side note, unrelated to devcoredump vs this..
AFAIU you probably want to be passing around a `struct pid *`, and then somehow use pid_vnr() in the context of the process reading the event to get the numeric pid. Otherwise things will not do what you expect if the process triggering the crash is in a different pid namespace from the compositor.
I am not sure if it is a good idea to add the pid extraction complexity in here, it is left upto the driver to extract this information and pass it to the work queue. In case of AMDGPU, its extracted from GPU VM. It would be then more flexible for the drivers as well.
- Shashank
BR, -R
uint32_t flags;
char pname[TASK_COMM_LEN];
+};
int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property); -- 2.32.0
Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
On 3/16/2022 10:50 PM, Rob Clark wrote:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma contactshashanksharma@gmail.com wrote:
From: Shashank Sharma shashank.sharma@amd.com
This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)
This patch also introduces the first flag of the flags bitmap, which can be appended as and when required.
V2: Addressed review comments from Christian and Amar - move the reset information structure to DRM layer - drop _ctx from struct name - make pid 32 bit(than 64) - set flag when VRAM invalid (than valid) - add process name as well (Amar)
Cc: Alexandar Deucher alexander.deucher@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Amaranath Somalapuram amaranath.somalapuram@amd.com Signed-off-by: Shashank Sharma shashank.sharma@amd.com
drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++ include/drm/drm_sysfs.h | 10 ++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index 430e00b16eec..840994810910 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev) } EXPORT_SYMBOL(drm_sysfs_hotplug_event);
+/**
- drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
- @dev: DRM device
- @reset_info: The contextual information about the reset (like
PID, flags)
- Send a uevent for the DRM device specified by @dev. This informs
- user that a GPU reset has occurred, so that an interested client
- can take any recovery or profiling measure.
- */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info) +{ + unsigned char pid_str[13]; + unsigned char flags_str[15]; + unsigned char pname_str[TASK_COMM_LEN + 6]; + unsigned char reset_str[] = "RESET=1"; + char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
+ if (!reset_info) { + DRM_WARN("No reset info, not sending the event\n"); + return; + }
+ DRM_DEBUG("generating reset event\n");
+ snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid); + snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname); + snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
- kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+} +EXPORT_SYMBOL(drm_sysfs_reset_event);
/** * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector * change diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h index 6273cac44e47..5ba11c760619 100644 --- a/include/drm/drm_sysfs.h +++ b/include/drm/drm_sysfs.h @@ -1,16 +1,26 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _DRM_SYSFS_H_ #define _DRM_SYSFS_H_ +#include <linux/sched.h>
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
struct drm_device; struct device; struct drm_connector; struct drm_property;
+struct drm_reset_event { + uint32_t pid;
One side note, unrelated to devcoredump vs this..
AFAIU you probably want to be passing around a `struct pid *`, and then somehow use pid_vnr() in the context of the process reading the event to get the numeric pid. Otherwise things will not do what you expect if the process triggering the crash is in a different pid namespace from the compositor.
I am not sure if it is a good idea to add the pid extraction complexity in here, it is left upto the driver to extract this information and pass it to the work queue. In case of AMDGPU, its extracted from GPU VM. It would be then more flexible for the drivers as well.
Yeah, but that is just used for debugging.
If we want to use the pid for housekeeping, like for a daemon which kills/restarts processes, we absolutely need that or otherwise won't be able to work with containers.
Regards, Christian.
- Shashank
BR, -R
+ uint32_t flags; + char pname[TASK_COMM_LEN]; +};
int drm_class_device_register(struct device *dev); void drm_class_device_unregister(struct device *dev);
void drm_sysfs_hotplug_event(struct drm_device *dev); +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info); void drm_sysfs_connector_hotplug_event(struct drm_connector *connector); void drm_sysfs_connector_status_event(struct drm_connector *connector, struct drm_property *property); -- 2.32.0
Hi,
On Thu, 17 Mar 2022 at 09:21, Christian König christian.koenig@amd.com wrote:
Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
AFAIU you probably want to be passing around a `struct pid *`, and then somehow use pid_vnr() in the context of the process reading the event to get the numeric pid. Otherwise things will not do what you expect if the process triggering the crash is in a different pid namespace from the compositor.
I am not sure if it is a good idea to add the pid extraction complexity in here, it is left upto the driver to extract this information and pass it to the work queue. In case of AMDGPU, its extracted from GPU VM. It would be then more flexible for the drivers as well.
Yeah, but that is just used for debugging.
If we want to use the pid for housekeeping, like for a daemon which kills/restarts processes, we absolutely need that or otherwise won't be able to work with containers.
100% this.
Pushing back to the compositor is a red herring. The compositor is just a service which tries to handle window management and input. If you're looking to kill the offending process or whatever, then that should go through the session manager - be it systemd or something container-centric or whatever. At least that way it can deal with cgroups at the same time, unlike the compositor which is not really aware of what the thing on the other end of the socket is doing. This ties in with the support they already have for things like coredump analysis, and would also be useful for other devices.
Some environments combine compositor and session manager, and a lot of them have them strongly related, but they're very definitely not the same thing ...
Cheers, Daniel
dri-devel@lists.freedesktop.org