[PATCH] drm/amdgpu: only WARN freeing buffers when DMA is unavailable
Christian König
christian.koenig at amd.com
Fri Feb 10 11:56:18 UTC 2023
Am 10.02.23 um 12:30 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> >> The driver are resumed before the core Linux memory management is
> ready to serve allocations. E.g. swap for example isn't turned on yet.
>
> >> This means that this stuff only worked because we were able to
> allocate memory from the pool which isn't guaranteed in any way.
>
> Memory allocation failure can happen at any time, every programmer
> should correctly handle it.
>
We are not talking about memory allocation failure, we are talking about
the kernel calling panic() because it can't properly resume.
Regards,
Christian.
> If memory allocation failure is not critical error and can gracefully
> continue to run, it should be acceptable.
>
> The memory allocation failure during mes self test should be the
> acceptable one. It will not make system hang up and
>
> driver can gracefully continue to run.
>
> Regards,
>
> Jack
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Friday, February 10, 2023 6:25 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>; Quan, Evan <Evan.Quan at amd.com>;
> Christian König <ckoenig.leichtzumerken at gmail.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when DMA
> is unavailable
>
> Hi Jack,
>
> Am 10.02.23 um 10:51 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> Hi Christian,
>
> >> Allocating buffers temporary for stuff like that is illegal
> during resume.
>
> Can you **deeply** explain why it is illegal during ip late_init
> stage which is a part stage of resume?
>
>
> Well no, I don't have the time to explain this to everybody individually.
>
> [Jack] …
>
> In my understanding, after gmc ready, driver can allocate/free
> kernel bo, and after SDMA ready,
>
> the eviction should be ready. What else prevent driver doing that
> during resume?
>
>
> The driver are resumed before the core Linux memory management is
> ready to serve allocations. E.g. swap for example isn't turned on yet.
>
> This means that this stuff only worked because we were able to
> allocate memory from the pool which isn't guaranteed in any way.
>
> >> I strongly suggest to just remove the MES test. It's abusing
> the kernel ring interface in a way we didn't want anyway and is
> currently replaced by Shahanks work.
>
> The kernel mes unit test is very meaningful and important to catch
> MES firmware issue at first time before
>
> issue went spread to other components like kfd/umd to avoid the
> problem complicated, Otherwise, the issue
>
> would become hard to catch and debug.
>
> Secondly, for mes unit test is self-containing and no dependency,
> it is a part of milestone to qualify MES ready,
>
> indicating that it can deliver to other component especially
> during brinup. It is likely ring test and ib test indicating
>
> gfx is ready to go. After totally transitioning to gfx user queue,
> mes unit test may be the only one unit test which
>
> can indicate gfx is ready at the very early stage of bringup when
> UMD is not ready.
>
>
> Alex and I are the maintainers of the driver who are deciding stuff
> like that and at least I don't really buy that argument. The ring, IB
> and benchmark tests are in the kernel module because they are simple.
>
> If we have a complicated unit test like simulating creating an MES
> user queue to test the firmware functionality than that's really
> overkill. Especially when you need to allocate memory for it.
>
> We previously had people requesting to add shader code and other
> complicated testing and rejected that as well because it just bloat up
> the kernel driver unnecessarily.
>
> If we can modify the MES test to not abuse the amdgpu_ring structure
> only work with memory from the SA for example we could keep this, but
> not really in the current state.
>
> Regards,
> Christian.
>
> Regards,
>
> Jack
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Friday, February 10, 2023 4:08 PM
> *To:* Quan, Evan <Evan.Quan at amd.com> <mailto:Evan.Quan at amd.com>;
> Christian König <ckoenig.leichtzumerken at gmail.com>
> <mailto:ckoenig.leichtzumerken at gmail.com>; Xiao, Jack
> <Jack.Xiao at amd.com> <mailto:Jack.Xiao at amd.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when
> DMA is unavailable
>
> Hi Evan,
>
> yeah, exactly that's what this warning should prevent. Allocating
> buffers temporary for stuff like that is illegal during resume.
>
> I strongly suggest to just remove the MES test. It's abusing the
> kernel ring interface in a way we didn't want anyway and is
> currently replaced by Shahanks work.
>
> Regards,
> Christian.
>
> Am 10.02.23 um 05:12 schrieb Quan, Evan:
>
> [AMD Official Use Only - General]
>
> Hi Jack,
>
> Are you trying to fix the call trace popped up on resuming below?
>
> It seems mes created some bo for its self test and freed it up
> later at the final stage of the resuming process.
>
> All these happened before the in_suspend flag cleared. And
> that triggered the call trace.
>
> Is my understanding correct?
>
> [74084.799260] WARNING: CPU: 2 PID: 2891 at
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:425
> amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]
>
> [74084.811019] Modules linked in: nls_iso8859_1 amdgpu(OE)
> iommu_v2 gpu_sched drm_buddy drm_ttm_helper ttm
> drm_display_helper drm_kms_helper i2c_algo_bit fb_sys_fops
> syscopyarea sysfillrect sysimgblt snd_sm
>
> [74084.811042] ip_tables x_tables autofs4 hid_logitech_hidpp
> hid_logitech_dj hid_generic e1000e usbhid ptp uas hid video
> i2c_i801 ahci pps_core crc32_pclmul i2c_smbus usb_storage
> libahci wmi
>
> [74084.914519] CPU: 2 PID: 2891 Comm: kworker/u16:38 Tainted:
> G W IOE 6.0.0-custom #1
>
> [74084.923146] Hardware name: ASUS System Product Name/PRIME
> Z390-A, BIOS 2004 11/02/2021
>
> [74084.931074] Workqueue: events_unbound async_run_entry_fn
>
> [74084.936393] RIP: 0010:amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]
>
> [74084.942422] Code: 00 4d 85 ed 74 08 49 c7 45 00 00 00 00 00
> 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 5b 41 5c 41 5d 41 5e 41
> 5f 5d c3 cc cc cc cc <0f> 0b e9 39 ff ff ff 3d 00 fe ff ff 0f
> 85 75 96 47 00 ebf
>
> [74084.961199] RSP: 0000:ffffbed6812ebb90 EFLAGS: 00010202
>
> [74084.966435] RAX: 0000000000000000 RBX: ffffbed6812ebc50
> RCX: 0000000000000000
>
> [74084.973578] RDX: ffffbed6812ebc70 RSI: ffffbed6812ebc60
> RDI: ffffbed6812ebc50
>
> [74084.980725] RBP: ffffbed6812ebbb8 R08: 0000000000000000
> R09: 00000000000001ff
>
> [74084.987869] R10: ffffbed6812ebb40 R11: 0000000000000000
> R12: ffffbed6812ebc70
>
> [74084.995015] R13: ffffbed6812ebc60 R14: ffff963a2945cc00
> R15: ffff9639c7da5630
>
> [74085.002160] FS: 0000000000000000(0000)
> GS:ffff963d1dc80000(0000) knlGS:0000000000000000
>
> [74085.010262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>
> [74085.016016] CR2: 0000000000000000 CR3: 0000000377c0a001
> CR4: 00000000003706e0
>
> [74085.023164] DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000
>
> [74085.030307] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> DR7: 0000000000000400
>
> [74085.037453] Call Trace:
>
> [74085.039911] <TASK>
>
> [74085.042023] amdgpu_mes_self_test+0x385/0x460 [amdgpu]
>
> [74085.047293] mes_v11_0_late_init+0x44/0x50 [amdgpu]
>
> [74085.052291] amdgpu_device_ip_late_init+0x50/0x270 [amdgpu]
>
> [74085.058032] amdgpu_device_resume+0xb0/0x2d0 [amdgpu]
>
> [74085.063187] amdgpu_pmops_resume+0x37/0x70 [amdgpu]
>
> [74085.068162] pci_pm_resume+0x68/0x100
>
> [74085.071836] ? pci_legacy_resume+0x80/0x80
>
> [74085.075943] dpm_run_callback+0x4c/0x160
>
> [74085.079873] device_resume+0xad/0x210
>
> [74085.083546] async_resume+0x1e/0x40
>
> [74085.087046] async_run_entry_fn+0x30/0x120
>
> [74085.091152] process_one_work+0x21a/0x3f0
>
> [74085.095173] worker_thread+0x50/0x3e0
>
> [74085.098845] ? process_one_work+0x3f0/0x3f0
>
> [74085.103039] kthread+0xfa/0x130
>
> [74085.106189] ? kthread_complete_and_exit+0x20/0x20
>
> [74085.110993] ret_from_fork+0x1f/0x30
>
> [74085.114576] </TASK>
>
> [74085.116773] ---[ end trace 0000000000000000 ]---
>
> BR
>
> Evan
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
> <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
> *Christian König
> *Sent:* Monday, February 6, 2023 5:00 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>
> <mailto:Jack.Xiao at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers
> when DMA is unavailable
>
> Am 06.02.23 um 09:28 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> >> >> It's simply not allowed to free up
> resources during suspend since those can't be acquired
> again during resume.
>
> >> The in_suspend flag is set at the beginning of suspend
> and unset at the end of resume. It can’t filter the case
> you mentioned.
>
>
> Why not? This is exactly what it should do.
>
> [Jack] If freeing up resources during resume, it should
> not hit the issue you described. But only checking
> in_suspend flag would take these cases as warning.
>
>
> No, once more: Freeing up or allocating resources between
> suspend and resume is illegal!
>
> If you free up a resource during resume you should absolutely
> hit that, this is intentional!
>
> Regards,
> Christian.
>
>
> Regards,
>
> Jack
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Monday, February 6, 2023 4:06 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>
> <mailto:Jack.Xiao at amd.com>; Christian König
> <ckoenig.leichtzumerken at gmail.com>
> <mailto:ckoenig.leichtzumerken at gmail.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
> buffers when DMA is unavailable
>
> Am 06.02.23 um 08:23 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> >> Nope, that is not related to any hw state.
>
> can use other flag.
>
> >> It's simply not allowed to free up resources during
> suspend since those can't be acquired again during resume.
>
> The in_suspend flag is set at the beginning of suspend
> and unset at the end of resume. It can’t filter the
> case you mentioned.
>
>
> Why not? This is exactly what it should do.
>
>
> Do you know the root cause of these cases hitting the
> issue? So that we can get an exact point to warn the
> freeing up behavior.
>
>
> Well the root cause are programming errors. See between
> suspending and resuming you should not allocate nor free
> memory.
>
> Otherwise we can run into trouble. And this check here is
> one part of that, we should probably add another warning
> during allocation of memory. But this here is certainly
> correct.
>
> Regards,
> Christian.
>
>
> Thanks,
>
> Jack
>
> *From:* Christian König
> <ckoenig.leichtzumerken at gmail.com>
> <mailto:ckoenig.leichtzumerken at gmail.com>
> *Sent:* Friday, February 3, 2023 9:20 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>
> <mailto:Jack.Xiao at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
> buffers when DMA is unavailable
>
> Nope, that is not related to any hw state.
>
> It's simply not allowed to free up resources during
> suspend since those can't be acquired again during resume.
>
> We had a couple of cases now where this was wrong. If
> you get a warning from that please fix the code which
> tried to free something during suspend instead.
>
> Regards,
> Christian.
>
> Am 03.02.23 um 07:04 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> >> It's simply illegal to free up memory during
> suspend.
>
> Why? In my understanding, the limit was caused by
> DMA shutdown.
>
> Regards,
>
> Jack
>
> *From:* Koenig, Christian
> <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Thursday, February 2, 2023 7:43 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>
> <mailto:Jack.Xiao at amd.com>;
> amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Subject:* AW: [PATCH] drm/amdgpu: only WARN
> freeing buffers when DMA is unavailable
>
> Big NAK to this! This warning is not related in
> any way to the hw state.
>
> It's simply illegal to free up memory during suspend.
>
> Regards,
>
> Christian.
>
> ------------------------------------------------------------------------
>
> *Von:*Xiao, Jack <Jack.Xiao at amd.com>
> *Gesendet:* Donnerstag, 2. Februar 2023 10:54
> *An:*
> amd-gfx at lists.freedesktop.org<amd-gfx at lists.freedesktop.org>;
> Deucher, Alexander <Alexander.Deucher at amd.com>;
> Koenig, Christian <Christian.Koenig at amd.com>
> *Cc:* Xiao, Jack <Jack.Xiao at amd.com>
> *Betreff:* [PATCH] drm/amdgpu: only WARN freeing
> buffers when DMA is unavailable
>
> Reduce waringings, only warn when DMA is unavailable.
>
> Signed-off-by: Jack Xiao <Jack.Xiao at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 2d237f3d3a2e..e3e3764ea697 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -422,7 +422,8 @@ void
> amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64
> *gpu_addr,
> if (*bo == NULL)
> return;
>
> -
> WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend);
> +
> WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend
> &&
> +
> !amdgpu_ttm_adev((*bo)->tbo.bdev)->ip_blocks[AMD_IP_BLOCK_TYPE_SDMA].status.hw);
>
> if (likely(amdgpu_bo_reserve(*bo, true)
> == 0)) {
> if (cpu_addr)
> --
> 2.37.3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230210/1981dafd/attachment-0001.htm>
More information about the amd-gfx
mailing list