[PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.
Christian König
christian.koenig at amd.com
Thu Oct 31 10:03:49 UTC 2024
Hi Chong,
Am 31.10.24 um 10:54 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> Hi, Christian.
>
> Share the process of the page fault issue in rocblas benchmark.
>
finally some progress here. Thanks for the update.
> Find when there are multithreads read register “regIH_VMID_0_LUT” to
> get pasid,
>
> This register will return error pasid value randomly, sometimes is 0,
> sometimes is 32768, (the real value is 32770).
>
> After check the invalid pasid, code will “continue” and not flush the
> gpu tlb.
>
That is really disturbing, concurrent register access is mandatory to
work correctly.
Not only the TLB flush but many other operations depend on stuff like
that as well.
> That’s why the page fault accours.
>
> After add the lock, the register not return invalid value, and the
> rocblas benchmark passed.
>
> You have submit a patch "implement TLB flush fence", in this patch you
> create a kernel thread to flush gpu tlb.
>
> And in main thread the function “svm_range_map_to_gpus” will call
> function “kfd_flush_tlb” and then flush gpu tlb as well.
>
> Means that both the two threads will call function
> “gmc_v11_0_flush_gpu_tlb_pasid”.
>
> So after you merge your patch, the page fault issue accours.
>
> My first patch change flush gpu tlb to sync mode,
>
> means the one thread flush the gpu tlb twice, so my first patch passed
> the rocblas benchmark.
>
I will have to reject such patches, you need to find the underlying
problem and not mitigate the symptoms.
> I already submit an email to firmware team to ask why the register
> will return wrong value.
>
> But if the firmware team not able to solve this issue, or need a long
> time to solve this issue,
>
> I will submit the patch like below to do the workaround.
>
Well that basically means a complete stop for any deliverable.
The driver stack simply won't work correctly when register reads return
random values like that.
Regards,
Christian.
> Thanks,
>
> Chong.
>
> *From:*Li, Chong(Alan)
> *Sent:* Friday, October 25, 2024 2:46 PM
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Andjelkovic, Dejan
> <Dejan.Andjelkovic at amd.com>
> *Cc:* cao, lin <lin.cao at amd.com>; Yin, ZhenGuo (Chris)
> <ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine)
> <Tiantian.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* RE: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode
> to sync mode.
>
> Hi, Christian.
>
> The size of log file so large, can’t paste in the Email.
>
> I copy the log file in directory “\\ark\incoming\chong\log
> <file://ark/incoming/chong/log>”, the log file name is “kern.log”.
>
> Can you access this directory ?
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>
> *Sent:* Thursday, October 24, 2024 7:22 PM
> *To:* Li, Chong(Alan) <Chong.Li at amd.com <mailto:Chong.Li at amd.com>>;
> Andjelkovic, Dejan <Dejan.Andjelkovic at amd.com
> <mailto:Dejan.Andjelkovic at amd.com>>
> *Cc:* cao, lin <lin.cao at amd.com <mailto:lin.cao at amd.com>>; Yin,
> ZhenGuo (Chris) <ZhenGuo.Yin at amd.com <mailto:ZhenGuo.Yin at amd.com>>;
> Zhang, Tiantian (Celine) <Tiantian.Zhang at amd.com
> <mailto:Tiantian.Zhang at amd.com>>; Raina, Yera <Yera.Raina at amd.com
> <mailto:Yera.Raina at amd.com>>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode
> to sync mode.
>
> Do you have the full log as text file? As image it's pretty much useless.
>
> Regards,
> Christian.
>
> Am 24.10.24 um 09:41 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian.
>
> We can see the dmesg log,
>
> After address “7ef90be00” already update the ptes, page fault
> still happen.
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Wednesday, October 23, 2024 5:26 PM
> *To:* Li, Chong(Alan) <Chong.Li at amd.com>
> <mailto:Chong.Li at amd.com>; Andjelkovic, Dejan
> <Dejan.Andjelkovic at amd.com> <mailto:Dejan.Andjelkovic at amd.com>
> *Cc:* cao, lin <lin.cao at amd.com> <mailto:lin.cao at amd.com>; Yin,
> ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>
> <mailto:ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine)
> <Tiantian.Zhang at amd.com> <mailto:Tiantian.Zhang at amd.com>; Raina,
> Yera <Yera.Raina at amd.com> <mailto:Yera.Raina at amd.com>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb
> mode to sync mode.
>
> Hi Chong,
>
> oh that could indeed be.
>
> I suggest to add a trace point for the page fault so that we can
> guarantee that we use the same time basis for both events.
>
> That should make it trivial to compare them.
>
> Regards,
> Christian.
>
> Am 23.10.24 um 10:17 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian.
>
> *I add a log in kernel, and prove the timestamp in tracing log
> is slower than dmesg log, *
>
> *so we can’t give a conclusion that the issue in rocm.*
>
> ------------------------ the information I sync with
> Andjelkovic, Dejan ----------------------------------------
>
> dmesg shows that the page fault happens address
> “0x000072e5f4401000” at time “6587.772178”,
>
> tracing log shows that the function “amdgpu_vm_update_ptes” be
> called at time “6587.790869”,
>
> ------------------------ the information I sync with
> Andjelkovic, Dejan ----------------------------------------
>
> From the log time stamp, you give a conclusion that “The test
> tries to access memory before it is probably mapped and that
> is provable by looking into the tracelogs.”.
>
> But after I review the code, the function
> “amdgpu_vm_ptes_update” be called in function
> “svm_range_set_attr”,
>
> So, after this log in above dmesg print “[ 6587.772136]
> amdgpu: pasid 0x8002 svms 0x000000008b03ff39 [0x72e5f4400
> 0x72e5fc3ff] done, r=0”,
>
> the function “svm_range_set_attr” will leave, in that time
> “amdgpu_vm_ptes_update” is already be called, the timestamp is
> not reasonable.
>
> I think maybe the timestamp in tracing log has some delay, and
> I add a line of log in kernel to verify my guess,
>
> The below is the result:
>
> tracing log shows the address “ffffffc00” at time “227.298607”,
>
> dmesg log print the address “ffffffc00” at time “226.756137”.
>
> traing log:
>
> dmesg log:
>
> Thanks,
>
> Chong.
>
> *From:*Li, Chong(Alan)
> *Sent:* Monday, October 21, 2024 6:38 PM
> *To:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>; Raina, Yera
> <Yera.Raina at amd.com> <mailto:Yera.Raina at amd.com>; Andjelkovic,
> Dejan <Dejan.Andjelkovic at amd.com>
> <mailto:Dejan.Andjelkovic at amd.com>
> *Cc:* cao, lin <lin.cao at amd.com> <mailto:lin.cao at amd.com>;
> Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>
> <mailto:ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine)
> <Tiantian.Zhang at amd.com> <mailto:Tiantian.Zhang at amd.com>
> *Subject:* RE: [PATCH] drm/amd/amdgpu: change the flush gpu
> tlb mode to sync mode.
>
> Hi, Christian.
>
> Thanks for your reply,
>
> And do you have any advice about this issue?
>
> Hi, Raina, Year.
> Share I assign this ticket SWDEV-459983
> <https://ontrack-internal.amd.com/browse/SWDEV-459983>to rocm
> team?
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>
> *Sent:* Monday, October 21, 2024 6:08 PM
> *To:* Li, Chong(Alan) <Chong.Li at amd.com
> <mailto:Chong.Li at amd.com>>; Raina, Yera <Yera.Raina at amd.com
> <mailto:Yera.Raina at amd.com>>
> *Cc:* cao, lin <lin.cao at amd.com <mailto:lin.cao at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu
> tlb mode to sync mode.
>
> Hi Chong,
>
> Andjelkovic just shared a bunch of traces from rocm on teams
> with me which I analyzed.
>
> When you know what you look for it's actually pretty obvious
> what's going on. Just look at the timestamp of the fault and
> compare that with the timestamp of the operation mapping
> something at the given address.
>
> When mapping an address happens only after accessing an
> address then there is clearly something wrong in the code
> which coordinates this and that is the ROCm stress test tool
> in this case.
>
> Regards,
> Christian.
>
> Am 21.10.24 um 11:02 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian, Raina, Yera.
>
> If this issue in rocm, I need assign my ticket
> SWDEV-459983
> <https://ontrack-internal.amd.com/browse/SWDEV-459983>to
> rocm team.
>
> Is there anything to share with the rocm pm?
>
> Such as the Email or chat history or the ticket you talk
> with Andjelkovic.
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Monday, October 21, 2024 4:00 PM
> *To:* Li, Chong(Alan) <Chong.Li at amd.com>
> <mailto:Chong.Li at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Cc:* cao, lin <lin.cao at amd.com> <mailto:lin.cao at amd.com>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush
> gpu tlb mode to sync mode.
>
> Am 21.10.24 um 07:56 schrieb Chong Li:
>
>
> change the gpu tlb flush mode to sync mode to
>
> solve the issue in the rocm stress test.
>
>
> And again complete NAK to this.
>
> I've already proven together with Andjelkovic that the
> problem is that the rocm stress test is broken.
>
> The test tries to access memory before it is probably
> mapped and that is provable by looking into the tracelogs.
>
> Regards,
> Christian.
>
>
>
>
>
>
>
> Signed-off-by: Chong Li<chongli2 at amd.com> <mailto:chongli2 at amd.com>
>
> ---
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 4 ++--
>
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
> index 51cddfa3f1e8..4d9ff7b31618 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
> @@ -98,7 +98,6 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
> f->adev = adev;
>
> f->dependency = *fence;
>
> f->pasid = vm->pasid;
>
> - INIT_WORK(&f->work, amdgpu_tlb_fence_work);
>
> spin_lock_init(&f->lock);
>
>
>
> dma_fence_init(&f->base, &amdgpu_tlb_fence_ops, &f->lock,
>
> @@ -106,7 +105,8 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
>
>
> /* TODO: We probably need a separate wq here */
>
> dma_fence_get(&f->base);
>
> - schedule_work(&f->work);
>
>
>
> *fence = &f->base;
>
> +
>
> + amdgpu_tlb_fence_work(&f->work);
>
> }
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image008.png
Type: image/png
Size: 61864 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 83231 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 74007 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 34357 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 7556 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 15396 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 37011 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0014.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image007.png
Type: image/png
Size: 24298 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241031/26153ee3/attachment-0015.png>
More information about the amd-gfx
mailing list