[PATCH 3/3] drm/amdgpu: Improve Vega20 XGMI TLB flush workaround

Mon Jan 20 18:28:29 UTC 2020

On 2020-01-20 12:58 p.m., Felix Kuehling wrote:
> On 2020-01-20 12:47 p.m., shaoyunl wrote:
>> comments in line .
>>
>> On 2020-01-17 8:37 p.m., Felix Kuehling wrote:
>>> Using a heavy-weight TLB flush once is not sufficient. Concurrent
>>> memory accesses in the same TLB cache line can re-populate TLB entries
>>> from stale texture cache (TC) entries while the heavy-weight TLB
>>> flush is in progress. To fix this race condition, perform another TLB
>>> flush after the heavy-weight one, when TC is known to be clean.
>>>
>>> Move the workaround into the low-level TLB flushing functions. This way
>>> they apply to amdgpu as well, and KIQ-based TLB flush only needs to
>>> synchronize once.
>>>
>>> CC: shaoyun.liu at amd.com
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  6 +-
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 68 
>>> +++++++++++++++++-----
>>>   2 files changed, 53 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 8609287620ea..5325f6b455f6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -647,13 +647,9 @@ int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct 
>>> kgd_dev *kgd, uint16_t vmid)
>>>   int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct kgd_dev *kgd, 
>>> uint16_t pasid)
>>>   {
>>>       struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
>>> -    uint32_t flush_type = 0;
>>> +    const uint32_t flush_type = 0;
>>>       bool all_hub = false;
>>>   -    if (adev->gmc.xgmi.num_physical_nodes &&
>>> -        adev->asic_type == CHIP_VEGA20)
>>> -        flush_type = 2;
>>> -
>>>       if (adev->family == AMDGPU_FAMILY_AI)
>>>           all_hub = true;
>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index 90216abf14a4..e2a5e852bdb0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -476,13 +476,26 @@ static void gmc_v9_0_flush_gpu_tlb(struct 
>>> amdgpu_device *adev, uint32_t vmid,
>>>   {
>>>       bool use_semaphore = gmc_v9_0_use_invalidate_semaphore(adev, 
>>> vmhub);
>>>       const unsigned eng = 17;
>>> -    u32 j, inv_req, tmp;
>>> +    u32 j, inv_req, inv_req2, tmp;
>>>       struct amdgpu_vmhub *hub;
>>>         BUG_ON(vmhub >= adev->num_vmhubs);
>>>         hub = &adev->vmhub[vmhub];
>>> -    inv_req = gmc_v9_0_get_invalidate_req(vmid, flush_type);
>>> +    if (adev->gmc.xgmi.num_physical_nodes &&
>>> +        adev->asic_type == CHIP_VEGA20) {
>>> +        /* Vega20+XGMI caches PTEs in TC and TLB. Add a
>>> +         * heavy-weight TLB flush (type 2), which flushes
>>> +         * both. Due to a race condition with concurrent
>>> +         * memory accesses using the same TLB cache line, we
>>> +         * still need a second TLB flush after this.
>>> +         */
>>> +        inv_req = gmc_v9_0_get_invalidate_req(vmid, 2);
>>> +        inv_req2 = gmc_v9_0_get_invalidate_req(vmid, flush_type);
>>
>> [shaoyunl]  For the send invalidation in this situation ,can we use 
>> 0  for the flush type directly ? I think no matter what's the input 
>> flush_type for this function , heavy-weight  + legacy invalidation 
>> should be enough for all of them .
>
> I'm not sure that's true. In the case of the race condition, there was 
> some concurrent memory access during the first heavy-weight 
> invalidation. If that is now flushed in the second invalidation, and a 
> heavy-weight invalidation was requested, we should also flush any TC 
> cache lines associated with that access. So hard-coding flush_type 0 
> here is probably not safe for all cases.
>
> Regards,
>   Felix
>
[shaoyunl]   Originally we use the  heavy-weight invalidation for XGMI 
here is due to the HW issue which always use NC even for remote GPU 
memory access (this lead walker to load the TLB directly from TC with 
stale value) . The heavy-weight  will set invalidate bit for both TLB 
and  TC so this will make the walker to load from main memory . Your 
change is based on the assumption that after first heavy-weight 
invalidation , the TC already load with  correct contents which seems  
should be true , so in this situation I think the light-weight or even 
legacy invalidation will be  enough since they will load from TC to TLB 
directly .

Regards

shaoyun.liu


>
>>
>>> +    } else {
>>> +        inv_req = gmc_v9_0_get_invalidate_req(vmid, flush_type);
>>> +        inv_req2 = 0;
>>> +    }
>>>         /* This is necessary for a HW workaround under SRIOV as well
>>>        * as GFXOFF under bare metal
>>> @@ -521,21 +534,27 @@ static void gmc_v9_0_flush_gpu_tlb(struct 
>>> amdgpu_device *adev, uint32_t vmid,
>>>               DRM_ERROR("Timeout waiting for sem acquire in VM 
>>> flush!\n");
>>>       }
>>>   -    WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, inv_req);
>>> +    do {
>>> +        WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, inv_req);
>>>   -    /*
>>> -     * Issue a dummy read to wait for the ACK register to be cleared
>>> -     * to avoid a false ACK due to the new fast GRBM interface.
>>> -     */
>>> -    if (vmhub == AMDGPU_GFXHUB_0)
>>> -        RREG32_NO_KIQ(hub->vm_inv_eng0_req + eng);
>>> +        /*
>>> +         * Issue a dummy read to wait for the ACK register to
>>> +         * be cleared to avoid a false ACK due to the new fast
>>> +         * GRBM interface.
>>> +         */
>>> +        if (vmhub == AMDGPU_GFXHUB_0)
>>> +            RREG32_NO_KIQ(hub->vm_inv_eng0_req + eng);
>>>   -    for (j = 0; j < adev->usec_timeout; j++) {
>>> -        tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack + eng);
>>> -        if (tmp & (1 << vmid))
>>> -            break;
>>> -        udelay(1);
>>> -    }
>>> +        for (j = 0; j < adev->usec_timeout; j++) {
>>> +            tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack + eng);
>>> +            if (tmp & (1 << vmid))
>>> +                break;
>>> +            udelay(1);
>>> +        }
>>> +
>>> +        inv_req = inv_req2;
>>> +        inv_req2 = 0;
>>> +    } while (inv_req);
>>>         /* TODO: It needs to continue working on debugging with 
>>> semaphore for GFXHUB as well. */
>>>       if (use_semaphore)
>>> @@ -577,9 +596,26 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
>>> amdgpu_device *adev,
>>>           return -EIO;
>>>         if (ring->sched.ready) {
>>> +        /* Vega20+XGMI caches PTEs in TC and TLB. Add a
>>> +         * heavy-weight TLB flush (type 2), which flushes
>>> +         * both. Due to a race condition with concurrent
>>> +         * memory accesses using the same TLB cache line, we
>>> +         * still need a second TLB flush after this.
>>> +         */
>>> +        bool vega20_xgmi_wa = (adev->gmc.xgmi.num_physical_nodes &&
>>> +                       adev->asic_type == CHIP_VEGA20);
>>> +        /* 2 dwords flush + 8 dwords fence */
>>> +        unsigned int ndw = kiq->pmf->invalidate_tlbs_size + 8;
>>> +
>>> +        if (vega20_xgmi_wa)
>>> +            ndw += kiq->pmf->invalidate_tlbs_size;
>>> +
>>>           spin_lock(&adev->gfx.kiq.ring_lock);
>>>           /* 2 dwords flush + 8 dwords fence */
>>> -        amdgpu_ring_alloc(ring, kiq->pmf->invalidate_tlbs_size + 8);
>>> +        amdgpu_ring_alloc(ring, ndw);
>>> +        if (vega20_xgmi_wa)
>>> +            kiq->pmf->kiq_invalidate_tlbs(ring,
>>> +                              pasid, 2, all_hub);
>>>           kiq->pmf->kiq_invalidate_tlbs(ring,
>>>                       pasid, flush_type, all_hub);
>>>           amdgpu_fence_emit_polling(ring, &seq);