[PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

Zhang, Jack (Jian) Jack.Zhang1 at amd.com
Fri Apr 3 05:06:46 UTC 2020


Thanks Monk,

I just updated the patch and it could passed 1000 rounds TDR test.

Sent out an review email.

Regards,
Jack
-----Original Message-----
From: Liu, Monk <Monk.Liu at amd.com> 
Sent: Friday, April 3, 2020 11:38 AM
To: Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>; amd-gfx at lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

Thanks Felix

Hi Jack

I think below changes can resolve your problem , we had this on our customer branch already, it fix the memory leak, and also fix my previous bug .
Can you make this change applied to gfx_v10/v9 ? thanks !

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
index 29749502..532258445 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
@@ -543,6 +543,8 @@ static int kgd_hqd_destroy(struct kgd_dev *kgd, void *mqd,
        uint32_t temp;
        struct v10_compute_mqd *m = get_mqd(mqd);

+       if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset)
+               return 0;
 #if 0
        unsigned long flags;
        int retry;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 35b32ad..f6479e1 100755
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3653,6 +3653,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
        if (r)
                return r;

+       amdgpu_amdkfd_pre_reset(adev);
+
        /* Resume IP prior to SMC */
        r = amdgpu_device_ip_reinit_early_sriov(adev);
        if (r)

_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com>
Sent: Friday, April 3, 2020 1:26 AM
To: Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>; amd-gfx at lists.freedesktop.org; Liu, Monk <Monk.Liu at amd.com>
Subject: Re: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

[+Monk]

This looks reasonable to me. However, you're effectively reverting this commit by Monk:

a03eb637d2a5 drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

In hind-sight, Monk's commit was broken. Removing the call to pre_reset has other consequences, such as breaking notifications about reset to user mode, and probably invalidating some assumptions in kfd_post_reset.
Can you coordinate with Monk to work out why his change was needed, and whether you'll need a different solution for the problem he was trying to address?

In the meanwhile, this patch is

Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>


Am 2020-04-02 um 3:20 a.m. schrieb Jack Zhang:

> kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate
>
> Without this change, sriov tdr code path will never free those 
> allocated memories and get memory leak.
>
> Signed-off-by: Jack Zhang <Jack.Zhang1 at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8faaa17..832daf7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,  {
>  	int r;
>  
> +	amdgpu_amdkfd_pre_reset(adev);
> +
>  	if (from_hypervisor)
>  		r = amdgpu_virt_request_full_gpu(adev, true);
>  	else


More information about the amd-gfx mailing list