[PATCH] drm/amdkfd: drop process ref count when xnack disable

Felix Kuehling felix.kuehling at amd.com
Wed Sep 1 19:43:20 UTC 2021


If it's not too late, please add

Fixes: 2383f56bbe4a ("drm/amdkfd: page table restore through svm API")

Thanks,
  Felix


Am 2021-09-01 um 1:54 p.m. schrieb Felix Kuehling:
> Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>> [Public]
>>
>>
>> [Public]
>>
>>
>> I wouldn’t know if it was another bug elsewhere.
>>
>> From what I was seeing, the leak was coming from !p->xnack_enable on
>> the svm_range_restore_pages call.
>>
>> If it helps, I saw this on Aldebaran where a shader does some bad
>> memory access on purpose on a debugged ptraced child process.
>>
> On Aldebaran the XNACK mode can be changed per process. But the page
> fault interrupts are retry faults (until they get turned into no-retry
> faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
> go into svm_range_restore_pages before they realize that the process in
> question doesn't use XNACK.
>
> The patch is
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>
>
>> The vm fault prompt pops up in dmesgs and a stale KFD process appends
>> per run without this fix.
>>
>> I’m just assuming at this point that the IV retry bit is set but I
>> never confirmed that.
>>
>>  
>>
>> Thanks,
>>
>>  
>>
>> Jon
>>
>> *From:* Yang, Philip <Philip.Yang at amd.com>
>> *Sent:* Wednesday, September 1, 2021 12:30 PM
>> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Yang, Philip
>> <Philip.Yang at amd.com>; Sierra Guiza, Alejandro (Alex)
>> <Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
>> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
>> disable
>>
>>  
>>
>>  
>>
>> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>>
>>     [AMD Official Use Only]
>>
>>      
>>
>>     We were seeing process leaks on a couple of machines running
>>     certain tests that triggered vm faults on purpose.
>>
>>     I think svm_range_restore_pages gets called unconditionally on vm
>>     fault handling (unless the retry interrupt payload bit is supposed
>>     to be clear with xnack off)?
>>
>>      
>>
>> yes, with xnack off, sh_mem_config retry should be off, retry bit is
>> supposed to be clear in fault interrupt vector, we should not try to
>> recover vm fault, just report the vm fault back to application and
>> evict user queues. Maybe it is another bug cause p->xnack_enabled and
>> sh_mem_config retry mismatch under specific condition?
>>
>> Regards,
>>
>> Philip
>>
>>     Either way, this patch prevents the process leaks we seeing and is
>>     also:
>>
>>     Reviewed-by: Jonathan Kim <jonathan.kim at amd.com>
>>     <mailto:jonathan.kim at amd.com>
>>
>>      
>>
>>     Thanks,
>>
>>      
>>
>>     Jon
>>
>>      
>>
>>      
>>
>>     *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
>>     <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
>>     *philip yang
>>     *Sent:* Wednesday, September 1, 2021 7:30 AM
>>     *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra at amd.com>
>>     <mailto:Alex.Sierra at amd.com>; amd-gfx at lists.freedesktop.org
>>     <mailto:amd-gfx at lists.freedesktop.org>
>>     *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
>>     xnack disable
>>
>>      
>>
>>     [CAUTION: External Email]
>>
>>      
>>
>>     On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>>
>>         During svm restore pages interrupt handler, kfd_process ref count was
>>
>>         never dropped when xnack was disabled. Therefore, the object was never
>>
>>         released.
>>
>>     Good catch, but if xnack is off, we should not get here to recover
>>     fault.
>>
>>     The fix looks good to me.
>>
>>     Reviewed-by: Philip Yang <philip.yang at amd.com>
>>     <mailto:philip.yang at amd.com>
>>
>>          
>>
>>         Signed-off-by: Alex Sierra <alex.sierra at amd.com> <mailto:alex.sierra at amd.com>
>>
>>         ---
>>
>>          drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>>
>>          1 file changed, 2 insertions(+), 1 deletion(-)
>>
>>          
>>
>>         diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         index 8f9b5b53dab5..110c46cd7fac 100644
>>
>>         --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>>
>>                  }
>>
>>                  if (!p->xnack_enabled) {
>>
>>                         pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>>
>>         -               return -EFAULT;
>>
>>         +               r = -EFAULT;
>>
>>         +               goto out;
>>
>>                  }
>>
>>                  svms = &p->svms;
>>
>>          
>>


More information about the amd-gfx mailing list