[PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later

Lang Yu Lang.Yu at amd.com
Thu Sep 14 05:58:53 UTC 2023


On 09/13/ , Felix Kuehling wrote:
> On 2023-09-13 6:23, Lang Yu wrote:
> > On 09/12/ , Felix Kuehling wrote:
> > > On 2023-09-11 22:52, Lang Yu wrote:
> > > > On 09/11/ , Harish Kasiviswanathan wrote:
> > > > > Heavy-weight TLB flush is required after unmap on all GPUs for
> > > > > correctness and security.
> > > > > 
> > > > > Signed-off-by: Harish Kasiviswanathan<Harish.Kasiviswanathan at amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
> > > > >    1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > index b315311dfe2a..b9950074aee0 100644
> > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
> > > > >    static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > > > >    {
> > > > > -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> > > > > -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > > > > +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
> > > > >    	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
> > > > >    	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > > > >    }
> > > > 1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
> > > > as described in commmit message, why we have this whitelist
> > > > instead of a blacklist?
> > > That was a bug that this patch is fixing. There were specific GPUs and
> > > firmware versions where the TLB flush after unmap was causing intermittent
> > > problems in specific tests. This should have always been a blacklist.
> > > 
> > > 
> > > > 2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
> > > > in svm_range_unmap_from_gpus(). Why not add this whitelist there?
> > > There was a patch that used kfd_flush_tlb_after_unmap in the SVM code. But
> > > you reverted that patch, probably because it caused more problems than it
> > > solved. SVM really must flush TLBs the way it does because it is so tightly
> > > integrated with Linux's virtual memory management and because with XNACK,
> > > memory can be unmapped while GPU work is in progress without preemting
> > > queues (implicitly flushing TLBs and caches):
> > > 
> > > commit 515d7cebc2e2d2b4f0a276d26f3b790a83cdfe06
> > > Author: Lang Yu<Lang.Yu at amd.com>
> > > Date:   Wed Apr 20 10:24:31 2022 +0800
> > > 
> > >      Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too"
> > >      This reverts commit 36bf93216ecbe399c40c5e0486f0f0e3a4afa69e.
> > >      It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
> > >      at the moment.
> > >      ./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
> > >      Signed-off-by: Lang Yu<Lang.Yu at amd.com>
> > >      Reviewed-by: Philip Yang<Philip.Yang at amd.com>
> > >      Signed-off-by: Alex Deucher<alexander.deucher at amd.com>
> > > 
> > > Regards,
> > >    Felix
> > Yes, that's because kfd_flush_tlb_after_unmap() return false for Vega10(gfx901).
> > kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is called unconditionally in SVM
> > for ASICs > IP_VERSION(9, 0, 0) and works well.
> > 
> > So why not relax the condition to KFD_GC_VERSION(dev) > IP_VERSION(9, 0, 0) ?
> 
> That would reintroduce the same problem that this workaround was meant to
> fix. I don't remember all the details of this, as it was years ago. I
> believe it was an intermittent hang or VM fault that was somewhat difficult
> to reproduce and investigate. Maybe Eric remembers the details as he was
> working on this bug back then. However, it was a real issue, and we got an
> SDMA firmware fix for it on GFX IP 9.4.1 as you can see from the FW version
> check in kfd_flush_tlb_after_unmap.
> 
> I would not recommend reverting this workaround at the risk of reintroducing
> a known intermittent bug that affects stability.

Got it. Thank you!

Regards,
Lang

> Regards,
>   Felix
> 
> 
> 
> > 
> > Regards,
> > Lang
> > 
> > > > Regards,
> > > > Lang
> > > > 
> > > > > -- 
> > > > > 2.34.1
> > > > > 


More information about the amd-gfx mailing list