[PATCH] drm/amdkfd: drop process ref count when xnack disable

Felix Kuehling felix.kuehling at amd.com
Wed Sep 1 17:54:08 UTC 2021


Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>
> [Public]
>
>
> [Public]
>
>
> I wouldn’t know if it was another bug elsewhere.
>
> From what I was seeing, the leak was coming from !p->xnack_enable on
> the svm_range_restore_pages call.
>
> If it helps, I saw this on Aldebaran where a shader does some bad
> memory access on purpose on a debugged ptraced child process.
>
On Aldebaran the XNACK mode can be changed per process. But the page
fault interrupts are retry faults (until they get turned into no-retry
faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
go into svm_range_restore_pages before they realize that the process in
question doesn't use XNACK.

The patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


> The vm fault prompt pops up in dmesgs and a stale KFD process appends
> per run without this fix.
>
> I’m just assuming at this point that the IV retry bit is set but I
> never confirmed that.
>
>  
>
> Thanks,
>
>  
>
> Jon
>
> *From:* Yang, Philip <Philip.Yang at amd.com>
> *Sent:* Wednesday, September 1, 2021 12:30 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Yang, Philip
> <Philip.Yang at amd.com>; Sierra Guiza, Alejandro (Alex)
> <Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
> disable
>
>  
>
>  
>
> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>
>     [AMD Official Use Only]
>
>      
>
>     We were seeing process leaks on a couple of machines running
>     certain tests that triggered vm faults on purpose.
>
>     I think svm_range_restore_pages gets called unconditionally on vm
>     fault handling (unless the retry interrupt payload bit is supposed
>     to be clear with xnack off)?
>
>      
>
> yes, with xnack off, sh_mem_config retry should be off, retry bit is
> supposed to be clear in fault interrupt vector, we should not try to
> recover vm fault, just report the vm fault back to application and
> evict user queues. Maybe it is another bug cause p->xnack_enabled and
> sh_mem_config retry mismatch under specific condition?
>
> Regards,
>
> Philip
>
>     Either way, this patch prevents the process leaks we seeing and is
>     also:
>
>     Reviewed-by: Jonathan Kim <jonathan.kim at amd.com>
>     <mailto:jonathan.kim at amd.com>
>
>      
>
>     Thanks,
>
>      
>
>     Jon
>
>      
>
>      
>
>     *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
>     <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
>     *philip yang
>     *Sent:* Wednesday, September 1, 2021 7:30 AM
>     *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra at amd.com>
>     <mailto:Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
>     xnack disable
>
>      
>
>     [CAUTION: External Email]
>
>      
>
>     On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>
>         During svm restore pages interrupt handler, kfd_process ref count was
>
>         never dropped when xnack was disabled. Therefore, the object was never
>
>         released.
>
>     Good catch, but if xnack is off, we should not get here to recover
>     fault.
>
>     The fix looks good to me.
>
>     Reviewed-by: Philip Yang <philip.yang at amd.com>
>     <mailto:philip.yang at amd.com>
>
>          
>
>         Signed-off-by: Alex Sierra <alex.sierra at amd.com> <mailto:alex.sierra at amd.com>
>
>         ---
>
>          drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>
>          1 file changed, 2 insertions(+), 1 deletion(-)
>
>          
>
>         diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         index 8f9b5b53dab5..110c46cd7fac 100644
>
>         --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>
>                  }
>
>                  if (!p->xnack_enabled) {
>
>                         pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>
>         -               return -EFAULT;
>
>         +               r = -EFAULT;
>
>         +               goto out;
>
>                  }
>
>                  svms = &p->svms;
>
>          
>


More information about the amd-gfx mailing list