[PATCH] drm/amdkfd: drop process ref count when xnack disable
Felix Kuehling
felix.kuehling at amd.com
Wed Sep 1 17:54:08 UTC 2021
Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>
> [Public]
>
>
> [Public]
>
>
> I wouldn’t know if it was another bug elsewhere.
>
> From what I was seeing, the leak was coming from !p->xnack_enable on
> the svm_range_restore_pages call.
>
> If it helps, I saw this on Aldebaran where a shader does some bad
> memory access on purpose on a debugged ptraced child process.
>
On Aldebaran the XNACK mode can be changed per process. But the page
fault interrupts are retry faults (until they get turned into no-retry
faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
go into svm_range_restore_pages before they realize that the process in
question doesn't use XNACK.
The patch is
Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
> The vm fault prompt pops up in dmesgs and a stale KFD process appends
> per run without this fix.
>
> I’m just assuming at this point that the IV retry bit is set but I
> never confirmed that.
>
>
>
> Thanks,
>
>
>
> Jon
>
> *From:* Yang, Philip <Philip.Yang at amd.com>
> *Sent:* Wednesday, September 1, 2021 12:30 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Yang, Philip
> <Philip.Yang at amd.com>; Sierra Guiza, Alejandro (Alex)
> <Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
> disable
>
>
>
>
>
> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>
> [AMD Official Use Only]
>
>
>
> We were seeing process leaks on a couple of machines running
> certain tests that triggered vm faults on purpose.
>
> I think svm_range_restore_pages gets called unconditionally on vm
> fault handling (unless the retry interrupt payload bit is supposed
> to be clear with xnack off)?
>
>
>
> yes, with xnack off, sh_mem_config retry should be off, retry bit is
> supposed to be clear in fault interrupt vector, we should not try to
> recover vm fault, just report the vm fault back to application and
> evict user queues. Maybe it is another bug cause p->xnack_enabled and
> sh_mem_config retry mismatch under specific condition?
>
> Regards,
>
> Philip
>
> Either way, this patch prevents the process leaks we seeing and is
> also:
>
> Reviewed-by: Jonathan Kim <jonathan.kim at amd.com>
> <mailto:jonathan.kim at amd.com>
>
>
>
> Thanks,
>
>
>
> Jon
>
>
>
>
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
> <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
> *philip yang
> *Sent:* Wednesday, September 1, 2021 7:30 AM
> *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra at amd.com>
> <mailto:Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
> xnack disable
>
>
>
> [CAUTION: External Email]
>
>
>
> On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>
> During svm restore pages interrupt handler, kfd_process ref count was
>
> never dropped when xnack was disabled. Therefore, the object was never
>
> released.
>
> Good catch, but if xnack is off, we should not get here to recover
> fault.
>
> The fix looks good to me.
>
> Reviewed-by: Philip Yang <philip.yang at amd.com>
> <mailto:philip.yang at amd.com>
>
>
>
> Signed-off-by: Alex Sierra <alex.sierra at amd.com> <mailto:alex.sierra at amd.com>
>
> ---
>
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> index 8f9b5b53dab5..110c46cd7fac 100644
>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>
> }
>
> if (!p->xnack_enabled) {
>
> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>
> - return -EFAULT;
>
> + r = -EFAULT;
>
> + goto out;
>
> }
>
> svms = &p->svms;
>
>
>
More information about the amd-gfx
mailing list