Re: 答复: 答复: 答复: [PATCH 1/2] drm/amdgpu: invalidate mmhub semphore workaround in amdgpu_virt

Christian König ckoenig.leichtzumerken at gmail.com
Wed Nov 20 14:38:48 UTC 2019


Hi Monk,

the KIQ is used to invalidate both the GFXHUB as well as the MMHUB on Vega.

> Besides, amdgpu_virt_kiq_reg_write_reg_wait() is not deadly a helper function that only serve VM invalidate, so I don't think
> You should put the semaphore read/write in this routine, instead you can put semaphore r/w out side of this routine and only
> Put them around the VM invalidate logic
Yes, agree. But since we now knew that we won't need that we can just 
drop this patch altogether.

Regards,
Christian.

Am 20.11.19 um 15:30 schrieb Liu, Monk:
> Thanks for sharing this JIR
>
> now I got the picture of this issue from you and Christian.
>
> So the semaphore grabbing can prevent RTL to power off the MMHUB, I see
>
> The practice is that SRIOV won't enable PG at all (even our GIM driver won't enable PG, maybe in future we would enable it )
>
> I think I don't have too many concern about your patches,
>
> But I have comments on your patch 1:
>
> void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
>   					uint32_t reg0, uint32_t reg1,
> -					uint32_t ref, uint32_t mask)
> +					uint32_t ref, uint32_t mask,
> +					uint32_t sem)
>   {
>   	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>   	struct amdgpu_ring *ring = &kiq->ring; @@ -144,9 +145,30 @@ void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
>   	uint32_t seq;
>   
>   	spin_lock_irqsave(&kiq->ring_lock, flags);
> -	amdgpu_ring_alloc(ring, 32);
> +	amdgpu_ring_alloc(ring, 60);
> +
> +	/*
> +	 * It may lose gpuvm invalidate acknowldege state across power-gating
> +	 * off cycle, add semaphore acquire before invalidation and semaphore
> +	 * release after invalidation to avoid entering power gated state
> +	 * to WA the Issue
> +	 */
> +
> +	/* a read return value of 1 means semaphore acuqire */
> +	if (ring->funcs->vmhub == AMDGPU_MMHUB_0 ||
> +	    ring->funcs->vmhub == AMDGPU_MMHUB_1)
> +	amdgpu_ring_emit_reg_wait(ring, sem, 0x1, 0x1);
>
>
> See that in this routine, the ring is always KIQ, so below code looks redundant :
>
> +	/* a read return value of 1 means semaphore acuqire */
> +	if (ring->funcs->vmhub == AMDGPU_MMHUB_0 ||
> +	    ring->funcs->vmhub == AMDGPU_MMHUB_1)
> +	amdgpu_ring_emit_reg_wait(ring, sem, 0x1, 0x1);
>
> Besides, amdgpu_virt_kiq_reg_write_reg_wait() is not deadly a helper function that only serve VM invalidate, so I don't think
> You should put the semaphore read/write in this routine, instead you can put semaphore r/w out side of this routine and only
> Put them around the VM invalidate logic
>
> Thanks
>
> -----邮件原件-----
> 发件人: Zhu, Changfeng <Changfeng.Zhu at amd.com>
> 发送时间: 2019年11月20日 22:17
> 收件人: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Huang, Shimmer <Xinmei.Huang at amd.com>; amd-gfx at lists.freedesktop.org
> 主题: RE: 答复: 答复: [PATCH 1/2] drm/amdgpu: invalidate mmhub semphore workaround in amdgpu_virt
>
>>>> Did Changfeng already hit this issue under SRIOV ???
> I meet this problem on navi14 under gmc_v10_0_emit_flush_gpu_tlb .
> The problem is also seen by Zhou,Tao.
>
> And this is ticket:
> http://ontrack-internal.amd.com/browse/SWDEV-201459
>
> After the semaphore patch, the problem can be fixed.
>
> If SROV has concern about this problem,  it should not add semaphore in SROV.
>
> However, we should apply semaphore for gmc_v9_0_flush_gpu_tlb/ gmc_v9_0_emit_flush_gpu_tlb/ gmc_v10_0_flush_gpu_tlb/ gmc_v10_0_emit_flush_gpu_tlb
>
> Or how can we handle the ticket above?
>
> BR,
> Changfeng.
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Wednesday, November 20, 2019 10:00 PM
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Zhu, Changfeng <Changfeng.Zhu at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Huang, Shimmer <Xinmei.Huang at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: 答复: 答复: [PATCH 1/2] drm/amdgpu: invalidate mmhub semphore workaround in amdgpu_virt
>
>> Did Changfeng already hit this issue under SRIOV ?
> I don't think so, but Changfeng needs to answer this.
>
> Question is does the extra semaphore acquire has some negative effect on SRIOV?
>
> I would like to avoid having even more SRIOV specific handling in here which we can't really test on bare metal.
>
> Christian.
>
> Am 20.11.19 um 14:54 schrieb Liu, Monk:
>> Hah, but in SRIOV case, our guest KMD driver is not allowed to do such
>> things .... (and even there is a bug that KMD try to power gate, the
>> SMU firmware would not really do the jobs since We have PSP L1 policy
>> to prevent those danger operations )
>>
>> Did Changfeng already hit this issue under SRIOV ???
>>
>> -----邮件原件-----
>> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
>> 发送时间: 2019年11月20日 21:21
>> 收件人: Liu, Monk <Monk.Liu at amd.com>; Zhu, Changfeng
>> <Changfeng.Zhu at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhou1, Tao
>> <Tao.Zhou1 at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Huang, Shimmer
>> <Xinmei.Huang at amd.com>; amd-gfx at lists.freedesktop.org
>> 主题: Re: 答复: [PATCH 1/2] drm/amdgpu: invalidate mmhub semphore
>> workaround in amdgpu_virt
>>
>> Hi Monk,
>>
>> this is a fix for power gating the MMHUB.
>>
>> Basic problem is that the MMHUB can power gate while an invalidation is in progress which looses all bits in the ACK register and so deadlocks the engine waiting for the invalidation to finish.
>>
>> This bug is hit immediately when we enable power gating of the MMHUB.
>>
>> Regards,
>> Christian.
>>
>> Am 20.11.19 um 14:18 schrieb Liu, Monk:
>>> Hi Changfeng
>>>
>>> Firs of all, there is no power-gating off circle involved in AMDGPU
>>> SRIOV, since we don't allow VF/VM do such things so I do feel strange
>>> why you post something like this Especially on VEGA10 serials which
>>> looks doesn't have any issue on those gpu_flush part
>>>
>>> Here is my questions for you:
>>> 1) Can you point me what issue had you been experienced ? and how to
>>> repro the bug
>>> 2) if you do hit some issues, did you verified that your patch can fix it ?
>>>
>>> besides
>>>
>>> /Monk
>>>
>>> -----邮件原件-----
>>> 发件人: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> 代表 Changfeng.Zhu
>>> 发送时间: 2019年11月20日 17:14
>>> 收件人: Koenig, Christian <Christian.Koenig at amd.com>; Xiao, Jack
>>> <Jack.Xiao at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Huang, Ray
>>> <Ray.Huang at amd.com>; Huang, Shimmer <Xinmei.Huang at amd.com>;
>>> amd-gfx at lists.freedesktop.org
>>> 抄送: Zhu, Changfeng <Changfeng.Zhu at amd.com>
>>> 主题: [PATCH 1/2] drm/amdgpu: invalidate mmhub semphore workaround in
>>> amdgpu_virt
>>>
>>> From: changzhu <Changfeng.Zhu at amd.com>
>>>
>>> It may lose gpuvm invalidate acknowldege state across power-gating off cycle. To avoid this issue in virt invalidation, add semaphore acquire before invalidation and semaphore release after invalidation.
>>>
>>> Change-Id: Ie98304e475166b53eed033462d76423b6b0fc25b
>>> Signed-off-by: changzhu <Changfeng.Zhu at amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 26 ++++++++++++++++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |  3 ++-
>>>     drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    |  3 ++-
>>>     3 files changed, 28 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> index f04eb1a64271..70ffaf91cd12 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> @@ -135,7 +135,8 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device
>>> *adev, uint32_t reg, uint32_t v)
>>>     
>>>     void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
>>>     					uint32_t reg0, uint32_t reg1,
>>> -					uint32_t ref, uint32_t mask)
>>> +					uint32_t ref, uint32_t mask,
>>> +					uint32_t sem)
>>>     {
>>>     	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>>>     	struct amdgpu_ring *ring = &kiq->ring; @@ -144,9 +145,30 @@ void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
>>>     	uint32_t seq;
>>>     
>>>     	spin_lock_irqsave(&kiq->ring_lock, flags);
>>> -	amdgpu_ring_alloc(ring, 32);
>>> +	amdgpu_ring_alloc(ring, 60);
>>> +
>>> +	/*
>>> +	 * It may lose gpuvm invalidate acknowldege state across power-gating
>>> +	 * off cycle, add semaphore acquire before invalidation and semaphore
>>> +	 * release after invalidation to avoid entering power gated state
>>> +	 * to WA the Issue
>>> +	 */
>>> +
>>> +	/* a read return value of 1 means semaphore acuqire */
>>> +	if (ring->funcs->vmhub == AMDGPU_MMHUB_0 ||
>>> +	    ring->funcs->vmhub == AMDGPU_MMHUB_1)
>>> +	amdgpu_ring_emit_reg_wait(ring, sem, 0x1, 0x1);
>>> +
>>>     	amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
>>>     					    ref, mask);
>>> +	/*
>>> +	 * add semaphore release after invalidation,
>>> +	 * write with 0 means semaphore release
>>> +	 */
>>> +	if (ring->funcs->vmhub == AMDGPU_MMHUB_0 ||
>>> +	    ring->funcs->vmhub == AMDGPU_MMHUB_1)
>>> +	amdgpu_ring_emit_wreg(ring, sem, 0);
>>> +
>>>     	amdgpu_fence_emit_polling(ring, &seq);
>>>     	amdgpu_ring_commit(ring);
>>>     	spin_unlock_irqrestore(&kiq->ring_lock, flags); diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>> index b0b2bdc750df..bda6a2f37dc0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>> @@ -295,7 +295,8 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg);  void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);  void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
>>>     					uint32_t reg0, uint32_t rreg1,
>>> -					uint32_t ref, uint32_t mask);
>>> +					uint32_t ref, uint32_t mask,
>>> +					uint32_t sem);
>>>     int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool
>>> init);  int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev,
>>> bool init);  int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index f25cd97ba5f2..1ae59af7836a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -448,9 +448,10 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>>>     			!adev->in_gpu_reset) {
>>>     		uint32_t req = hub->vm_inv_eng0_req + eng;
>>>     		uint32_t ack = hub->vm_inv_eng0_ack + eng;
>>> +		uint32_t sem = hub->vm_inv_eng0_sem + eng;
>>>     
>>>     		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, tmp,
>>> -				1 << vmid);
>>> +						   1 << vmid, sem);
>>>     		return;
>>>     	}
>>>     
>>> --
>>> 2.17.1
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list