[PATCH v6] drm/amd: Add Stream Performance Counter Monitors Driver -v6

Thu Aug 20 05:18:10 UTC 2020

[+amd-gfx]

Am 2020-08-19 um 12:15 p.m. schrieb Ba, Gang:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi Felix,
>
> For write point update, what is the best way to do:

Do you mean read pointer update? The hardware updates the write pointer,
the driver updates the read pointer.

> 	In V5, the wptr always update no meter there is user buffer, this way keeps generate the interrupts. When user buffer setup, copy data directly. This way may waste system time.
> 	In V6, the wptr always update only if the user buffer setup, this way need to update rptr and wptr “(spm_update_rptr (pdd);” to generate interrupt first then copy data, after the user buffer full, stop update the wptr.

Maybe I'm missing some details about when SPM generates interrupts. Does
it generate interrupts regularly, or only when a certain fill-level is
reached?

If interrupts are only generated when the ring buffer is nearly full,
then what happens depends on the size of the user buffers in relation to
the ring buffer size.

 1. If the ring buffer is very large compared to user buffers, then you
    would only copy a small part of the ring buffer to a user buffer and
    you'd never really empty the ring buffer. That means the data that
    user mode gets is maybe a few seconds old because it never gets the
    most recently generated data.
 2. If the ring buffer is very small compared to the user buffer, then
    you will fill the user buffer in small increments with several
    interrupts until it is full. This is somewhat better because user
    mode gets more up-to-date data. But you risk losing data if the
    ring-buffer is too small. And you need to handle more interrupts,
    which is less efficient.

Ideally, you'd want program the threshold at which the interrupt
triggers to match the user buffer size (or some reasonable fraction of
the ring buffer size, whichever is smaller). So when you get an
interrupt, there is either just enough data in the ring to fill the next
user buffer or the ring buffer is nearly full and should be emptied.

When the user sets a new user buffer, after you program the interrupt
threshold, you should also schedule a worker to read the ring buffer, in
case the write pointer was already above the threshold that would have
triggered an interrupt.

>
> Also I like to confirm when call “ret = wait_event_interruptible_timeout(spm->spm_buf_wq,
> 				 (spm->is_user_buf_filled == true),
> 				 timeout);”
> 	using user_spm_data->timeout = jiffies_to_msecs(timeout); when ret = -ERESTARTSYS;
> 	using user_spm_data->timeout = jiffies_to_msecs(ret); when ret >0;

Sorry, I was wrong about this. wait_event_interruptible_timeout doesn't
tell you how much time elapsed if it returns -ERESTARTSYS. But you can
calculate it by saving the current jiffies before you start waiting.

    long timeout = msecs_to_jiffies(user_spm_data->timeout);
    long start_jiffies = jiffies;

    ret = wait_event_interruptible_timeout(...);
    ...
    case -ERESTARTSYS:
    	/* Subtract elapsed time from timeout so we wait that much
    	 * less when the call gets restarted.
    	 */
    	timeout -= (jiffies - start_jiffies);
    	if (timeout <= 0) {
    		ret = -ETIME;
    		timeout = 0;
    	}
    	user_spm_data->timeout = jiffies_to_msecs(timeout);
    	break;

Regards,
  Felix

>
> Thank you very much!
>
> Gang
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com> 
> Sent: Monday, August 17, 2020 5:22 PM
> To: Ba, Gang <Gang.Ba at amd.com>; brahma_sw_dev <brahma_sw_dev at amd.com>
> Subject: Re: [PATCH v6] drm/amd: Add Stream Performance Counter Monitors Driver -v6
>
> See comments inline.
>
> Am 2020-08-12 um 9:41 p.m. schrieb Gang Ba:
>> Add Driver code for user to control GPU Stream Performance Counter 
>> Monitor and dump the Sample data to measure the GPU performance.
>>
>> Signed-off-by: Gang Ba <gaba at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/Makefile                |   4 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h         |   8 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_rlc_spm.c | 104 +++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h            |  12 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c             | 125 +++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c              | 127 ++++++
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c              | 114 ++++++
>>  drivers/gpu/drm/amd/amdkfd/Makefile                |   3 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c           |  18 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h              |  13 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c           |   1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_spm.c               | 439 +++++++++++++++++++++
>>  include/uapi/linux/kfd_ioctl.h                     |  51 ++-
>>  13 files changed, 1015 insertions(+), 4 deletions(-)  create mode 
>> 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_rlc_spm.c
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_spm.c
>>
> [snip]
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_spm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_spm.c
>> new file mode 100644
>> index 0000000..1596ebd
>> --- /dev/null
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_spm.c
>> @@ -0,0 +1,439 @@
>> +/*
>> + * Copyright 2020 Advanced Micro Devices, Inc.
>> + *
>> + * Permission is hereby granted, free of charge, to any person 
>> +obtaining a
>> + * copy of this software and associated documentation files (the 
>> +"Software"),
>> + * to deal in the Software without restriction, including without 
>> +limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, 
>> +sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom 
>> +the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be 
>> +included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
>> +EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
>> +MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
>> +SHALL
>> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
>> +DAMAGES OR
>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
>> +OTHERWISE,
>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE 
>> +OR
>> + * OTHER DEALINGS IN THE SOFTWARE.
>> + */
>> +
>> +#include <linux/device.h>
>> +#include "kfd_priv.h"
>> +#include "amdgpu_amdkfd.h"
>> +#include "amdgpu_irq.h"
>> +#include "ivsrcid/gfx/irqsrcs_gfx_9_0.h"
>> +#include "ivsrcid/ivsrcid_vislands30.h"
>> +#include <linux/mmu_context.h> // for use_mm() #include 
>> +<linux/wait.h>
>> +
>> +struct user_buf {
>> +	uint64_t __user *user_addr;
>> +	u32 ubufsize;
>> +};
>> +
>> +struct kfd_spm_cntr {
>> +	struct kgd_dev *kgd;
> This is unnecessary. You can get this with pdd->dev->kgd. So you can remove this variable and also the kgd parameter from all SPM functions that have a pdd parameter.
>
>
>> +	struct user_buf ubuf;
>> +	struct task_struct *thread;
> Thread is unused.
>
>
>> +	u64    gpu_addr;
>> +	u32    buf_size;
>> +	u32    buf_mark;
> I think you mean buf_mask here. Maybe rename these to ring_size and ring_mask to distinguish them from the user buffer.
>
>
>> +	u32    ring_rptr;
>> +	u32    size_copied;
>> +	u32    buf_count;
> I think the buf_count is only ever 0 or 1. Maybe turn this into a bool has_user_buf or similar.
>
>
>> +	u32    has_data_loss;
>> +	u32    *cpu_addr;
>> +	void   *spm_obj;
>> +	char   *name;
> Name isn't used anywhere.
>
>
>> +	wait_queue_head_t spm_buf_wq;
>> +	bool   is_user_buf_filled;
>> +	bool   is_set_timeout;
>> +	bool   is_spm_started;
>> +};
>> +
>> +static void kfd_spm_wakeup(struct kfd_process_device *pdd) {
>> +	mutex_lock(&pdd->spm_mutex);
>> +	wake_up(&pdd->spm_cntr->spm_buf_wq);
>> +	mutex_unlock(&pdd->spm_mutex);
>> +}
>> +
>> +static int kfd_spm_data_copy(struct kfd_process_device *pdd, u32 
>> +size_to_copy) {
>> +	struct kfd_spm_cntr *spm = pdd->spm_cntr;
>> +	uint64_t __user *user_address;
>> +	uint64_t *ring_buf;
>> +	u32 user_buf_space_left;
>> +	int ret = 0;
>> +
>> +	if (spm->ubuf.user_addr == NULL)
>> +		return -EFAULT;
>> +
>> +	user_address = (uint64_t *)((uint64_t)spm->ubuf.user_addr + spm->size_copied);
>> +	ring_buf =  (uint64_t *)((uint64_t)spm->cpu_addr + spm->ring_rptr);
>> +
>> +	if (user_address == NULL)
>> +		return -EFAULT;
>> +
>> +	user_buf_space_left = spm->ubuf.ubufsize - spm->size_copied;
>> +
>> +	if (size_to_copy < user_buf_space_left) {
>> +		ret = copy_to_user(user_address, ring_buf, size_to_copy);
>> +		if (ret) {
>> +			spm->has_data_loss = true;
>> +			return -EFAULT;
>> +		}
>> +		spm->size_copied += size_to_copy;
>> +	} else {
>> +		ret = copy_to_user(user_address, ring_buf, user_buf_space_left);
>> +		if (ret) {
>> +			spm->has_data_loss = true;
>> +			return -EFAULT;
>> +		}
>> +
>> +		spm->size_copied = spm->ubuf.ubufsize;
>> +		spm->is_user_buf_filled = true;
>> +		if (spm->is_set_timeout == true)
>> +			kfd_spm_wakeup(pdd);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int kfd_spm_read_ring_buffer(struct kfd_process_device *pdd) {
>> +	struct kfd_spm_cntr *spm = pdd->spm_cntr;
>> +	u32 size_to_copy;
>> +	int ret = 0;
>> +	u32 ring_wptr;
>> +
>> +	ring_wptr = READ_ONCE(spm->cpu_addr[0]) & spm->buf_mark;
> I think this entire function needs to hold the spm mutex. Otherwise a concurrent set_dest_buffer could remove the buffer while you're writing to it.
>
>
>> +
>> +	/* keep SPM ring buffer running */
>> +	if (spm->is_user_buf_filled == true) {
>> +		spm->has_data_loss = true;
>> +		goto exit;
>> +	}
>> +
>> +	if (!spm->buf_count) {
>> +		spm->has_data_loss = true;
>> +		goto exit;
>> +	}
>> +
>> +	if (spm->ring_rptr == ring_wptr)
>> +		goto exit;
>> +
>> +	if ((spm->ring_rptr >= 0) &&  (spm->ring_rptr  < 0x20)) {
>> +		/*
>> +		 * First 8DW, only use for WritePtr, it is not Counter data
>> +		 */
>> +		spm->ring_rptr = 0x20;
>> +	}
>> +
>> +	if (ring_wptr > spm->ring_rptr) {
>> +		size_to_copy = ring_wptr - spm->ring_rptr;
>> +		ret = kfd_spm_data_copy(pdd, size_to_copy);
>> +	} else {
>> +		size_to_copy = spm->buf_size - spm->ring_rptr;
>> +		ret = kfd_spm_data_copy(pdd, size_to_copy);
>> +
>> +		/* correct counter start point */
>> +		spm->ring_rptr = 0x20;
>> +		size_to_copy = spm->buf_size - spm->ring_rptr;
>> +		if (!ret)
>> +			ret = kfd_spm_data_copy(pdd, size_to_copy);
>> +	}
>> +
>> +	spm->ring_rptr = ring_wptr;
> This is still not correct. kfd_spm_data_copy may not copy everything out of the ring. It only copies until the user buffer is full. If you update rptr to wptr here, you will lose all the data that is in the ring buffer and didn't fit into the user buffer.
>
> It would probably be best if spm->ring_rptr were updated in kfd_spm_data copy to reflect exactly how much data was copied.
>
>
>> +	amdgpu_amdkfd_rlc_spm_set_rdptr(spm->kgd, spm->ring_rptr);
>> +exit:
>> +	return ret;
>> +}
>> +
>> +static void kfd_spm_work(struct work_struct *work) {
>> +	struct kfd_process_device *pdd = container_of(work, struct kfd_process_device, spm_work);
>> +	struct mm_struct *mm = NULL; // referenced
>> +
>> +	mm = get_task_mm(pdd->process->lead_thread);
>> +	if (mm) {
>> +		use_mm(mm);
>> +		{ /* attach mm */
>> +			kfd_spm_read_ring_buffer(pdd);
>> +		} /* detach mm */
>> +		unuse_mm(mm);
>> +		/* release the mm structure */
>> +		mmput(mm);
>> +	}
>> +}
>> +
>> +void kfd_spm_init_process_device(struct kfd_process_device *pdd) {
>> +	mutex_init(&pdd->spm_mutex);
>> +	pdd->spm_cntr = NULL;
>> +}
>> +
>> +static int kfd_acquire_spm(struct kfd_process_device *pdd, struct 
>> +kgd_dev *kgd) {
>> +	int ret = 0;
>> +
>> +	mutex_lock(&pdd->spm_mutex);
>> +
>> +	if (pdd->spm_cntr) {
>> +		mutex_unlock(&pdd->spm_mutex);
>> +		return -EINVAL;
>> +	}
>> +
>> +	pdd->spm_cntr = kzalloc(sizeof(struct kfd_spm_cntr), GFP_KERNEL);
>> +	if (!pdd->spm_cntr) {
>> +		mutex_unlock(&pdd->spm_mutex);
>> +		return -ENOMEM;
>> +	}
>> +	mutex_unlock(&pdd->spm_mutex);
>> +
>> +	/* git spm ring buffer 128KB */
>> +	pdd->spm_cntr->buf_size = order_base_2(128 * 1024/4);
>> +	pdd->spm_cntr->buf_size = (1 << pdd->spm_cntr->buf_size) * 4;
>> +	pdd->spm_cntr->buf_mark = pdd->spm_cntr->buf_size - 1;
>> +
>> +	ret = amdgpu_amdkfd_alloc_gtt_mem(kgd,
>> +			pdd->spm_cntr->buf_size, &pdd->spm_cntr->spm_obj,
>> +			&pdd->spm_cntr->gpu_addr, (void *)&pdd->spm_cntr->cpu_addr,
>> +			false, false);
>> +
>> +	if (ret)
>> +		goto alloc_gtt_mem_failure;
>> +
>> +	ret =  amdgpu_amdkfd_rlc_spm_acquire(kgd, pdd->vm,
>> +			pdd->spm_cntr->gpu_addr, pdd->spm_cntr->buf_size);
>> +
>> +	/*
>> +	 * By definition, the last 8 DWs of the buffer are not part of the rings
>> +	 *  and are instead part of the Meta data area.
>> +	 */
>> +	pdd->spm_cntr->buf_size -= 0x20;
>> +
>> +	if (ret)
>> +		goto acquire_spm_failure;
>> +
>> +	init_waitqueue_head(&pdd->spm_cntr->spm_buf_wq);
>> +	INIT_WORK(&pdd->spm_work, kfd_spm_work);
>> +
>> +	spin_lock_init(&pdd->spm_irq_lock);
>> +
>> +	pdd->spm_cntr->kgd = kgd;
>> +
>> +	goto out;
>> +
>> +acquire_spm_failure:
>> +	amdgpu_amdkfd_free_gtt_mem(kgd, pdd->spm_cntr->spm_obj);
>> +
>> +alloc_gtt_mem_failure:
>> +	mutex_lock(&pdd->spm_mutex);
>> +	kfree(pdd->spm_cntr);
>> +	pdd->spm_cntr = NULL;
>> +	mutex_unlock(&pdd->spm_mutex);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int kfd_release_spm(struct kfd_process_device *pdd, struct 
>> +kgd_dev *kgd) {
>> +	unsigned long flags;
>> +
>> +	mutex_lock(&pdd->spm_mutex);
>> +	if (!pdd->spm_cntr) {
>> +		mutex_unlock(&pdd->spm_mutex);
>> +		return -EINVAL;
>> +	}
>> +	mutex_unlock(&pdd->spm_mutex);
>> +
>> +	spin_lock_irqsave(&pdd->spm_irq_lock, flags);
>> +	pdd->spm_cntr->is_spm_started = false;
>> +	wake_up_all(&pdd->spm_cntr->spm_buf_wq);
> To avoid race conditions with the interrupt handler, also set
> pdd->spm_cntr = NULL here, and save it in a local variable for the rest
> of this function.
>
> Also keep holding the spm_mutex until after the spin_unlock_irqrestore.
> You can hold both locks at once.
>
>
>> +	spin_unlock_irqrestore(&pdd->spm_irq_lock, flags);
>> +
>> +	flush_work(&pdd->spm_work);
>> +	amdgpu_amdkfd_rlc_spm_release(kgd, pdd->vm);
>> +	amdgpu_amdkfd_free_gtt_mem(kgd, pdd->spm_cntr->spm_obj);
>> +
>> +	mutex_lock(&pdd->spm_mutex);
>> +	kfree(pdd->spm_cntr);
>> +	pdd->spm_cntr = NULL;
>> +	mutex_unlock(&pdd->spm_mutex);
>> +	return 0;
>> +}
>> +
>> +static void spm_copy_data_to_usr(struct kfd_ioctl_spm_args 
>> +*user_spm_data, struct kfd_process_device *pdd) {
>> +	user_spm_data->bytes_copied = pdd->spm_cntr->size_copied;
>> +	user_spm_data->has_data_loss = pdd->spm_cntr->has_data_loss;
>> +	pdd->spm_cntr->buf_count--;
>> +}
>> +
>> +static void spm_set_dest_info(struct kfd_process_device *pdd, struct 
>> +kfd_ioctl_spm_args *user_spm_data) {
>> +	pdd->spm_cntr->ubuf.user_addr = (uint64_t *)user_spm_data->dest_buf;
>> +	pdd->spm_cntr->ubuf.ubufsize = user_spm_data->buf_size;
>> +	pdd->spm_cntr->has_data_loss = false;
>> +	pdd->spm_cntr->size_copied = 0;
>> +	pdd->spm_cntr->is_user_buf_filled = false;
>> +	pdd->spm_cntr->buf_count++;
>> +}
>> +
>> +static int spm_wait_for_fill_awake(struct kfd_spm_cntr *spm, struct 
>> +kfd_ioctl_spm_args *user_spm_data) {
>> +	int ret = 0;
>> +
>> +	long timeout = msecs_to_jiffies(user_spm_data->timeout);
>> +	ret = wait_event_interruptible_timeout(spm->spm_buf_wq,
>> +				 (spm->is_user_buf_filled == true),
>> +				 timeout);
>> +
>> +	switch (ret) {
>> +	case 0:
>> +		/* timeout */
>> +		ret = -ETIME;
>> +		user_spm_data->timeout = 0;
>> +		pr_debug("[%s] timeoutt\n", __func__);
>> +		break;
>> +	case -ERESTARTSYS:
>> +		/* interrupted by signal */
>> +		pr_debug("[%s] interrupted by signal\n", __func__);
> You still need to update user_spm_data->timeout here:
>
>     user_spm_data->timeout = jiffies_to_msecs(timeout);
>
> The timeout variable is updated by the wait_event_interruptible_timeout macro. This fact is poorly documented and it takes some digging through the kernel headers to see this.
>
>
>> +		break;
>> +	default:
>> +		user_spm_data->timeout = jiffies_to_msecs(ret);
> It's probably safer to set ret = 0 here.
>
>
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +static void spm_update_rptr(struct kfd_process_device *pdd) {
>> +	u32 ring_wptr;
>> +
>> +	ring_wptr = READ_ONCE(pdd->spm_cntr->cpu_addr[0]) & pdd->spm_cntr->buf_mark;
>> +	if (ring_wptr == pdd->spm_cntr->ring_rptr && ring_wptr)
>> +		pdd->spm_cntr->ring_rptr = 0;
>> +	else
>> +		pdd->spm_cntr->ring_rptr = ring_wptr;
>> +
>> +	amdgpu_amdkfd_rlc_spm_set_rdptr(pdd->spm_cntr->kgd, 
>> +pdd->spm_cntr->ring_rptr); }
>> +
>> +static int kfd_set_dest_buffer(struct kfd_process_device *pdd, struct 
>> +kgd_dev *kgd, void *data) {
>> +	struct kfd_ioctl_spm_args *user_spm_data =
>> +			(struct kfd_ioctl_spm_args *) data;
>> +	struct kfd_spm_cntr *spm = pdd->spm_cntr;
>> +	int ret = 0;
>> +
>> +	mutex_lock(&pdd->spm_mutex);
>> +	if (spm == NULL) {
>> +		mutex_unlock(&pdd->spm_mutex);
>> +		return -EINVAL;
>> +	}
>> +	mutex_unlock(&pdd->spm_mutex);
>> +
>> +	if (user_spm_data->timeout != 0)
>> +		spm->is_set_timeout = true;
>> +	else
>> +		spm->is_set_timeout = false;
>> +
>> +	if (user_spm_data->timeout && spm->buf_count && !spm->is_user_buf_filled) {
>> +		/* wait for previous buffer to fill */
>> +		ret = spm_wait_for_fill_awake(spm, user_spm_data);
>> +		if (ret == -ERESTARTSYS)
>> +			return ret;
>> +	}
>> +
>> +	mutex_lock(&pdd->spm_mutex);
>> +	if (spm->buf_count) {
>> +		/* get info about filled space in previous output buffer */
>> +		spm_copy_data_to_usr(user_spm_data, pdd);
>> +	}
>> +
>> +	if (user_spm_data->dest_buf) {
>> +		/* setup new dest buf, start streaming if necessary */
>> +		spm_set_dest_info(pdd, user_spm_data);
>> +
>> +		/* Start SPM  */
>> +		if (spm->is_spm_started == false) {
>> +			amdgpu_amdkfd_rlc_spm_cntl(kgd, 1);
>> +			spm->is_spm_started = true;
>> +		}
>> +		spm_update_rptr (pdd);
> There is an extra space before (pdd).
>
> Why do you need to update the RPTR here? This function doesn't do any reading from the SPM ring buffer?
>
> I think what would make sense here, is to schedule work to write any available SPM data to the new destination buffer in case the old buffer filled up before we were able to empty the ring buffer. That worker would then update the RPTR.
>
>
>> +	} else {
>> +		amdgpu_amdkfd_rlc_spm_cntl(kgd, 0);
>> +		spm->is_spm_started = false;
>> +	}
>> +	mutex_unlock(&pdd->spm_mutex);
>> +
>> +	return ret;
>> +}
>> +
>> +int kfd_rlc_spm(struct kfd_process *p,  void *data,
>> +		       uint32_t buf_size, __u32 operation) {
>> +	struct kfd_ioctl_spm_args *args = data;
>> +	struct kfd_dev *dev;
>> +	struct kfd_process_device *pdd;
>> +
>> +	dev = kfd_device_by_id(args->gpu_id);
>> +	if (!dev) {
>> +		pr_debug("Could not find gpu id 0x%x\n", args->gpu_id);
>> +		return -EINVAL;
>> +	}
>> +
>> +	pdd = kfd_get_process_device_data(dev, p);
>> +	if (!pdd)
>> +		return -EINVAL;
>> +
>> +	switch (operation) {
>> +	case KFD_IOCTL_SPM_OP_ACQUIRE:
>> +		dev->spm_pasid = p->pasid;
>> +		return  kfd_acquire_spm(pdd, dev->kgd);
>> +
>> +	case KFD_IOCTL_SPM_OP_RELEASE:
>> +		return  kfd_release_spm(pdd, dev->kgd);
>> +
>> +	case KFD_IOCTL_SPM_OP_SET_DEST_BUF:
>> +		return  kfd_set_dest_buffer(pdd, dev->kgd, data);
>> +
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	return -EINVAL;
>> +}
>> +
>> +void kgd2kfd_spm_interrupt(struct kfd_dev *dev) {
>> +	struct kfd_process_device *pdd;
>> +	uint16_t pasid = dev->spm_pasid;
>> +
>> +	struct kfd_process *p = kfd_lookup_process_by_pasid(pasid);
>> +	unsigned long flags;
>> +
>> +	if (!p) {
>> +		pr_debug("kfd_spm_interrupt p = %p\n", p);
>> +		return; /* Presumably process exited. */
>> +	}
>> +
>> +	pdd = kfd_get_process_device_data(dev, p);
>> +	if (!pdd)
>> +		return;
>> +
>> +	spin_lock_irqsave(&pdd->spm_irq_lock, flags);
>> +
>> +	if (pdd->spm_cntr && pdd->spm_cntr->is_spm_started)
>> +		schedule_work(&pdd->spm_work);
>> +	spin_unlock_irqrestore(&pdd->spm_irq_lock, flags);
> OK, this spin_lock looks like the right way to protect the pdd->spm_cntr access in the interrupt handler. With the fix I pointed out in kfd_release_spm, this should be good.
>
> Regards,
>   Felix
>
>
>> +
>> +	kfd_unref_process(p);
>> +}
>> +
>> diff --git a/include/uapi/linux/kfd_ioctl.h 
>> b/include/uapi/linux/kfd_ioctl.h index 6704384..564e657 100644
>> --- a/include/uapi/linux/kfd_ioctl.h
>> +++ b/include/uapi/linux/kfd_ioctl.h
>> @@ -652,6 +652,50 @@ struct kfd_ioctl_smi_events_args {
>>  	__u32 anon_fd;	/* from KFD */
>>  };
>>  
>> +/**
>> + * kfd_ioctl_spm_op - SPM ioctl operations
>> + *
>> + * @KFD_IOCTL_SPM_OP_ACQUIRE: acquire exclusive access to SPM
>> + * @KFD_IOCTL_SPM_OP_RELEASE: release exclusive access to SPM
>> + * @KFD_IOCTL_SPM_OP_SET_DEST_BUF: set or unset destination buffer 
>> +for SPM streaming  */ enum kfd_ioctl_spm_op {
>> +	KFD_IOCTL_SPM_OP_ACQUIRE,
>> +	KFD_IOCTL_SPM_OP_RELEASE,
>> +	KFD_IOCTL_SPM_OP_SET_DEST_BUF
>> +};
>> +
>> +
>> +/**
>> + * kfd_ioctl_spm_args - Arguments for SPM ioctl
>> + *
>> + * @op:     specifies the operation to perform
>> + * @dest_buf: used for the address of the destination buffer in 
>> + at KFD_IOCTL_SPM_SET_DEST_BUFFER
>> + * @buf_size:size  of the destination buffer in 
>> + at KFD_IOCTL_SPM_SET_DEST_BUFFER
>> + * @timeout: timeout to wait for the buffer to get filled
>> + * @gpu_id: gpu ID
>> + * @bytes_copied: this is the size in bytes copied to the previous 
>> +buffer
>> + * @has_data_loss: If set indicate not all data copied to user buffer
>> + *
>> + * If @dest_buf is NULL, the destination buffer address is unset and 
>> +copying of counters
>> + * is stopped.
>> + * the timeout will be updated with the time remaining.
>> + *
>> + * Returns negative error code on failure. On success, 
>> + at KFD_IOCTL_SPM_OP_ACQUIRE and
>> + * @KFD_IOCTL_SPM_OP_RELEASE return 0, @KFD_IOCTL_SPM_OP_SET_DEST_BUF 
>> +returns the fill
>> + * level of the previous buffer.
>> + */
>> +struct kfd_ioctl_spm_args {
>> +	__u64 dest_buf;
>> +	__u32 buf_size;
>> +	__u32 op;
>> +	__u32 timeout;
>> +	__u32 gpu_id;	/* to KFD */
>> +	/* from KFD: Total amount of bytes copied */
>> +	__u32 bytes_copied;
>> +	__u32 has_data_loss;
>> +};
>> +
>>  /* Register offset inside the remapped mmio page
>>   */
>>  enum kfd_mmio_remap {
>> @@ -806,6 +850,7 @@ struct kfd_ioctl_cross_memory_copy_args {
>>  #define AMDKFD_IOC_SMI_EVENTS			\
>>  		AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args)
>>  
>> +
>>  #define AMDKFD_COMMAND_START		0x01
>>  #define AMDKFD_COMMAND_END		0x20
>>  
>> @@ -822,7 +867,11 @@ struct kfd_ioctl_cross_memory_copy_args {
>>  #define AMDKFD_IOC_CROSS_MEMORY_COPY		\
>>  		AMDKFD_IOWR(0x83, struct kfd_ioctl_cross_memory_copy_args)
>>  
>> +#define AMDKFD_IOC_RLC_SPM		\
>> +		AMDKFD_IOWR(0x84, struct kfd_ioctl_spm_args)
>> +
>> +
>>  #define AMDKFD_COMMAND_START_2		0x80
>> -#define AMDKFD_COMMAND_END_2		0x84
>> +#define AMDKFD_COMMAND_END_2		0x85
>>  
>>  #endif