[PATCH 13/18] drm/amdgpu:fix driver unloading bug

Tue Sep 19 11:37:43 UTC 2017

> Well the question is why does the CPC still needs the MQD commands and how can we prevent that?

You are right, I'll see what can do

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] 
Sent: 2017年9月19日 16:27
To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen at amd.com>
Subject: Re: [PATCH 13/18] drm/amdgpu:fix driver unloading bug

Am 19.09.2017 um 06:14 schrieb Liu, Monk:
> Christian,
>
>> That sounds at least a bit better. But my question is why doesn't this work like it does on Tonga, e.g. correctly clean things up?
> Tonga also suffer with this issue, just that we fixed it in the branch for CSP customer and staging code usually behind our private branch ...
>
>> Yeah, gut keeping the GART mapping alive is complete nonsense. When the driver unloads all memory should be returned to the OS.
>> So we either keep a GART mapping to pages which are about to be reused and overwritten, or we leak memory on driver shutdown.
>> Neither options sounds very good,
> Unbinding the GART mapping makes CPC hang if it run MQD commands, and 
> CPC must run MQD commands because RLCV always Requires CPC do that 
> when RLCV doing SAVE_VF commands,
>
> Do you have way to fix above circle ?

Well the question is why does the CPC still needs the MQD commands and how can we prevent that?

The point is when we need to keep the GART alive to avoid a crash after driver unload we also need to keep the pages alive where the GART points to.

This means that the pages are either overwritten or we get massive complains from the MM that we are leaking pages here.

If it's not possible to turn of the CPC on driver unload the only alternative I can see is to reprogram it so that the MQD commands come from VRAM instead of GART.

Regards,
Christian.

>
>
> BR Monk
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: 2017年9月18日 19:54
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen at amd.com>
> Subject: Re: [PATCH 13/18] drm/amdgpu:fix driver unloading bug
>
> Am 18.09.2017 um 12:12 schrieb Liu, Monk:
>> Christian,
>>
>> Let's discuss this patch and the one follows which skip the KIQ MQD free to avoid SAVE_FAIL issue.
>>
>>
>> For skipping KIQ MQD deallocation patch, I think I will drop it and use a new way:
>> We allocate KIQ MQD in VRAM domain and this BO can be safely freed after driver unloaded, because after driver unloaded no one will change the data in this BO *usually*.
>> e.g. some root  app can map visible vram and alter the value in it
> That sounds at least a bit better. But my question is why doesn't this work like it does on Tonga, e.g. correctly clean things up?
>
>> for this patch "to skipping unbind the GART mapping to keep KIQ MQD always valid":
>> Since hypervisor side always have couple hw component working, and they rely on GMC kept alive, so this is very different with BARE-METAL. That's to say we can only do like this way.
> Yeah, gut keeping the GART mapping alive is complete nonsense. When the driver unloads all memory should be returned to the OS.
>
> So we either keep a GART mapping to pages which are about to be reused and overwritten, or we leak memory on driver shutdown.
>
> Neither options sounds very good,
> Christian.
>
>> Besides, we'll have more patches in future for L1 secure mode, which 
>> forbidden VF access GMC registers, so under L1 secure mode driver 
>> will always skip GMC programing under SRIOV both in init and fini, 
>> but that will come later
>>
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>> Sent: 2017年9月18日 17:28
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Chen, Horace <Horace.Chen at amd.com>
>> Subject: Re: [PATCH 13/18] drm/amdgpu:fix driver unloading bug
>>
>> Am 18.09.2017 um 08:11 schrieb Monk Liu:
>>> [SWDEV-126631] - fix hypervisor save_vf fail that occured after 
>>> driver
>>> removed:
>>> 1. Because the KIQ and KCQ were not ummapped, save_vf will fail if driver freed mqd of KIQ and KCQ.
>>> 2. KIQ can't be unmapped since RLCV always need it, the bo_free on 
>>> KIQ should be skipped 3. KCQ can be unmapped, and should be unmapped 
>>> during hw_fini, 4. RLCV still need to access other mc address from some hw even after driver unloaded,
>>>       So we should not unbind gart for VF.
>>>
>>> Change-Id: I320487a9a848f41484c5f8cc11be34aca807b424
>>> Signed-off-by: Horace Chen <horace.chen at amd.com>
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> I absolutely can't judge if this is correct or not, but keeping the GART and KIQ alive after the driver is unloaded sounds really fishy to me.
>>
>> Isn't there any other clean way of handling this?
>>
>> Christian.
>>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c |  3 +-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  |  5 +++
>>>     drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c    | 60 +++++++++++++++++++++++++++++++-
>>>     3 files changed, 66 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> index f437008..2fee071 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> @@ -394,7 +394,8 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
>>>      */
>>>     void amdgpu_gart_fini(struct amdgpu_device *adev)
>>>     {
>>> -	if (adev->gart.ready) {
>>> +	/* gart is still used by other hw under SRIOV, don't unbind it */
>>> +	if (adev->gart.ready && !amdgpu_sriov_vf(adev)) {
>>>     		/* unbind pages */
>>>     		amdgpu_gart_unbind(adev, 0, adev->gart.num_cpu_pages);
>>>     	}
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index 4f6c68f..bf6656f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -309,6 +309,11 @@ void amdgpu_gfx_compute_mqd_sw_fini(struct amdgpu_device *adev)
>>>     				      &ring->mqd_ptr);
>>>     	}
>>>     
>>> +	/* don't deallocate KIQ mqd because the bo is still used by RLCV even
>>> +	the guest VM is shutdown */
>>> +	if (amdgpu_sriov_vf(adev))
>>> +		return;
>>> +
>>>     	ring = &adev->gfx.kiq.ring;
>>>     	kfree(adev->gfx.mec.mqd_backup[AMDGPU_MAX_COMPUTE_RINGS]);
>>>     	amdgpu_bo_free_kernel(&ring->mqd_obj,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> index 44960b3..a577bbc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> @@ -2892,14 +2892,72 @@ static int gfx_v9_0_hw_init(void *handle)
>>>     	return r;
>>>     }
>>>     
>>> +static int gfx_v9_0_kcq_disable(struct amdgpu_ring *kiq_ring,struct 
>>> +amdgpu_ring *ring) {
>>> +	struct amdgpu_device *adev = kiq_ring->adev;
>>> +	uint32_t scratch, tmp = 0;
>>> +	int r, i;
>>> +
>>> +	r = amdgpu_gfx_scratch_get(adev, &scratch);
>>> +	if (r) {
>>> +		DRM_ERROR("Failed to get scratch reg (%d).\n", r);
>>> +		return r;
>>> +	}
>>> +	WREG32(scratch, 0xCAFEDEAD);
>>> +
>>> +	r = amdgpu_ring_alloc(kiq_ring, 10);
>>> +	if (r) {
>>> +		DRM_ERROR("Failed to lock KIQ (%d).\n", r);
>>> +		amdgpu_gfx_scratch_free(adev, scratch);
>>> +		return r;
>>> +	}
>>> +
>>> +	/* unmap queues */
>>> +	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_UNMAP_QUEUES, 4));
>>> +	amdgpu_ring_write(kiq_ring, /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
>>> +						PACKET3_UNMAP_QUEUES_ACTION(1) | /* RESET_QUEUES */
>>> +						PACKET3_UNMAP_QUEUES_QUEUE_SEL(0) |
>>> +						PACKET3_UNMAP_QUEUES_ENGINE_SEL(0) |
>>> +						PACKET3_UNMAP_QUEUES_NUM_QUEUES(1));
>>> +	amdgpu_ring_write(kiq_ring, PACKET3_UNMAP_QUEUES_DOORBELL_OFFSET0(ring->doorbell_index));
>>> +	amdgpu_ring_write(kiq_ring, 0);
>>> +	amdgpu_ring_write(kiq_ring, 0);
>>> +	amdgpu_ring_write(kiq_ring, 0);
>>> +	/* write to scratch for completion */
>>> +	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_UCONFIG_REG, 1));
>>> +	amdgpu_ring_write(kiq_ring, (scratch - PACKET3_SET_UCONFIG_REG_START));
>>> +	amdgpu_ring_write(kiq_ring, 0xDEADBEEF);
>>> +	amdgpu_ring_commit(kiq_ring);
>>> +
>>> +	for (i = 0; i < adev->usec_timeout; i++) {
>>> +		tmp = RREG32(scratch);
>>> +		if (tmp == 0xDEADBEEF)
>>> +			break;
>>> +		DRM_UDELAY(1);
>>> +	}
>>> +	if (i >= adev->usec_timeout) {
>>> +		DRM_ERROR("KCQ disabled failed (scratch(0x%04X)=0x%08X)\n", scratch, tmp);
>>> +		r = -EINVAL;
>>> +	}
>>> +	amdgpu_gfx_scratch_free(adev, scratch);
>>> +	return r;
>>> +}
>>> +
>>> +
>>>     static int gfx_v9_0_hw_fini(void *handle)
>>>     {
>>>     	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>> +	int i, r;
>>>     
>>>     	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>>>     	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
>>>     	if (amdgpu_sriov_vf(adev)) {
>>> -		pr_debug("For SRIOV client, shouldn't do anything.\n");
>>> +		/* disable KCQ to avoid CPC touch memory not valid anymore */
>>> +		for (i = 0; i < adev->gfx.num_compute_rings; i++) {
>>> +			r = gfx_v9_0_kcq_disable(&adev->gfx.kiq.ring, &adev->gfx.compute_ring[i]);
>>> +			if (r)
>>> +				return r;
>>> +		}
>>>     		return 0;
>>>     	}
>>>     	gfx_v9_0_cp_enable(adev, false);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx