[PATCH v2] drm/amdkfd: Add GPU reset SMI event

Thu Aug 27 03:36:20 UTC 2020

Am 2020-08-26 um 4:01 p.m. schrieb Mukul Joshi:
> Add support for reporting GPU reset events through SMI. KFD
> would report both pre and post GPU reset events.
>
> Signed-off-by: Mukul Joshi <mukul.joshi at amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c     |  4 +++
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h       |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 30 +++++++++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  1 +
>  include/uapi/linux/kfd_ioctl.h              |  2 ++
>  5 files changed, 39 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index e1cd6599529f..aad1ecfa1239 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -812,6 +812,8 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>  	if (!kfd->init_complete)
>  		return 0;
>  
> +	kfd_smi_event_update_gpu_reset(kfd, false);
> +
>  	kfd->dqm->ops.pre_reset(kfd->dqm);
>  
>  	kgd2kfd_suspend(kfd, false);
> @@ -833,6 +835,8 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>  	if (!kfd->init_complete)
>  		return 0;
>  
> +	kfd_smi_event_update_gpu_reset(kfd, true);
> +

You should generate the message only after successfully resuming KFD.

>  	ret = kfd_resume(kfd);
>  	if (ret)
>  		return ret;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 18bc711f97ae..b1a2979e086f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -312,6 +312,8 @@ struct kfd_dev {
>  	/* Clients watching SMI events */
>  	struct list_head smi_clients;
>  	spinlock_t smi_lock;
> +
> +	uint64_t reset_seq_num;

64-bit may be overkill. If you're resetting your GPU so frequently that
a 32-bit counter overflows, nobody wants to buy that GPU. ;)

>  };
>  
>  enum kfd_mempool {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> index 4d4b6e3ab697..4f0590bcb1a3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> @@ -174,6 +174,36 @@ static void add_event_to_kfifo(struct kfd_dev *dev, unsigned int smi_event,
>  	rcu_read_unlock();
>  }
>  
> +void kfd_smi_event_update_gpu_reset(struct kfd_dev *dev, bool post_reset)
> +{
> +	/*
> +	 * GpuReset msg = Reset seq number (incremented for
> +	 * every reset message sent before GPU reset).
> +	 * 1 byte event + 1 byte space + 16 bytes seq num +
> +	 * 1 byte \n + 1 byte \0 = 20
> +	 */
> +	char fifo_in[20];
> +	int len;
> +	unsigned int event;
> +
> +	if (list_empty(&dev->smi_clients)) {
> +		return;
> +	}
> +
> +	memset(fifo_in, 0x0, sizeof(fifo_in));
> +
> +	if (post_reset) {
> +		event = KFD_SMI_EVENT_GPU_POST_RESET;
> +	} else {
> +		event = KFD_SMI_EVENT_GPU_PRE_RESET;
> +		++(dev->reset_seq_num);

This sequence number is per-device. When multiple GPUs in an XGMI hive
get reset together, each GPU would get its own uncorrelated reset
sequence number. It may happen to be the same number, but I don't think
it can be guaranteed.

On the other hand, when two non-XGMI GPUs are reset at different times,
they could get the same sequence number, although the two resets were
caused by different unrelated events, and happened at different times.

It would be better to have a global reset sequence number that can be
used to correlate GPU resets that happen to multiple GPUs (in an XGMI
hive) at the same time, caused by the same event. The easiest way to
achieve this would be, to use a global reset_seq_num and increment it
only in kgd2kfd_pre_reset if atomic_read(&kfd_locked) is 0 (i.e. on the
first GPU that's getting reset at the same time).

> +	}
> +
> +	len = snprintf(fifo_in, 20, "%x %llx\n", event, dev->reset_seq_num);

A safer, more maintainable way would be to write "sizeof(fifo_in)"
instead of hard-coding 20.

Regards,
  Felix

> +
> +	add_event_to_kfifo(dev, event, fifo_in, len);
> +}
> +
>  void kfd_smi_event_update_thermal_throttling(struct kfd_dev *dev,
>  					     uint32_t throttle_bitmask)
>  {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
> index 15537b2cccb5..b9b0438202e2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
> @@ -27,5 +27,6 @@ int kfd_smi_event_open(struct kfd_dev *dev, uint32_t *fd);
>  void kfd_smi_event_update_vmfault(struct kfd_dev *dev, uint16_t pasid);
>  void kfd_smi_event_update_thermal_throttling(struct kfd_dev *dev,
>  					     uint32_t throttle_bitmask);
> +void kfd_smi_event_update_gpu_reset(struct kfd_dev *dev, bool post_reset);
>  
>  #endif
> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> index cb1f963a84e0..8b7368bfbd84 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -453,6 +453,8 @@ enum kfd_smi_event {
>          KFD_SMI_EVENT_NONE = 0, /* not used */
>          KFD_SMI_EVENT_VMFAULT = 1, /* event start counting at 1 */
>          KFD_SMI_EVENT_THERMAL_THROTTLE = 2,
> +	KFD_SMI_EVENT_GPU_PRE_RESET = 3,
> +	KFD_SMI_EVENT_GPU_POST_RESET = 4,
>  };
>  
>  #define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1))