[PATCH] drm/amdgpu: only WARN freeing buffers when DMA is unavailable

Fri Feb 10 11:56:18 UTC 2023

Am 10.02.23 um 12:30 schrieb Xiao, Jack:
>
> [AMD Official Use Only - General]
>
> >> The driver are resumed before the core Linux memory management is 
> ready to serve allocations. E.g. swap for example isn't turned on yet.
>
> >> This means that this stuff only worked because we were able to 
> allocate memory from the pool which isn't guaranteed in any way.
>
> Memory allocation failure can happen at any time, every programmer 
> should correctly handle it.
>

We are not talking about memory allocation failure, we are talking about 
the kernel calling panic() because it can't properly resume.

Regards,
Christian.

> If memory allocation failure is not critical error and can gracefully 
> continue to run, it should be acceptable.
>
> The memory allocation failure during mes self test should be the 
> acceptable one. It will not make system hang up and
>
> driver can gracefully continue to run.
>
> Regards,
>
> Jack
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Friday, February 10, 2023 6:25 PM
> *To:* Xiao, Jack <Jack.Xiao at amd.com>; Quan, Evan <Evan.Quan at amd.com>; 
> Christian König <ckoenig.leichtzumerken at gmail.com>; 
> amd-gfx at lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when DMA 
> is unavailable
>
> Hi Jack,
>
> Am 10.02.23 um 10:51 schrieb Xiao, Jack:
>
>     [AMD Official Use Only - General]
>
>     Hi Christian,
>
>     >> Allocating buffers temporary for stuff like that is illegal
>     during resume.
>
>     Can you **deeply** explain why it is illegal during ip late_init
>     stage which is a part stage of resume?
>
>
> Well no, I don't have the time to explain this to everybody individually.
>
> [Jack] …
>
>     In my understanding, after gmc ready, driver can allocate/free
>     kernel bo, and after SDMA ready,
>
>     the eviction should be ready. What else prevent driver doing that
>     during resume?
>
>
> The driver are resumed before the core Linux memory management is 
> ready to serve allocations. E.g. swap for example isn't turned on yet.
>
> This means that this stuff only worked because we were able to 
> allocate memory from the pool which isn't guaranteed in any way.
>
>     >> I strongly suggest to just remove the MES test. It's abusing
>     the kernel ring interface in a way we didn't want anyway and is
>     currently replaced by Shahanks work.
>
>     The kernel mes unit test is very meaningful and important to catch
>     MES firmware issue at first time before
>
>     issue went spread to other components like kfd/umd to avoid the
>     problem complicated, Otherwise, the issue
>
>     would become hard to catch and debug.
>
>     Secondly, for mes unit test is self-containing and no dependency,
>     it is a part of milestone to qualify MES ready,
>
>     indicating that it can deliver to other component especially
>     during brinup. It is likely ring test and ib test indicating
>
>     gfx is ready to go. After totally transitioning to gfx user queue,
>     mes unit test may be the only one unit test which
>
>     can indicate gfx is ready at the very early stage of bringup when
>     UMD is not ready.
>
>
> Alex and I are the maintainers of the driver who are deciding stuff 
> like that and at least I don't really buy that argument. The ring, IB 
> and benchmark tests are in the kernel module because they are simple.
>
> If we have a complicated unit test like simulating creating an MES 
> user queue to test the firmware functionality than that's really 
> overkill. Especially when you need to allocate memory for it.
>
> We previously had people requesting to add shader code and other 
> complicated testing and rejected that as well because it just bloat up 
> the kernel driver unnecessarily.
>
> If we can modify the MES test to not abuse the amdgpu_ring structure 
> only work with memory from the SA for example we could keep this, but 
> not really in the current state.
>
> Regards,
> Christian.
>
>     Regards,
>
>     Jack
>
>     *From:* Koenig, Christian <Christian.Koenig at amd.com>
>     <mailto:Christian.Koenig at amd.com>
>     *Sent:* Friday, February 10, 2023 4:08 PM
>     *To:* Quan, Evan <Evan.Quan at amd.com> <mailto:Evan.Quan at amd.com>;
>     Christian König <ckoenig.leichtzumerken at gmail.com>
>     <mailto:ckoenig.leichtzumerken at gmail.com>; Xiao, Jack
>     <Jack.Xiao at amd.com> <mailto:Jack.Xiao at amd.com>;
>     amd-gfx at lists.freedesktop.org; Deucher, Alexander
>     <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
>     *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when
>     DMA is unavailable
>
>     Hi Evan,
>
>     yeah, exactly that's what this warning should prevent. Allocating
>     buffers temporary for stuff like that is illegal during resume.
>
>     I strongly suggest to just remove the MES test. It's abusing the
>     kernel ring interface in a way we didn't want anyway and is
>     currently replaced by Shahanks work.
>
>     Regards,
>     Christian.
>
>     Am 10.02.23 um 05:12 schrieb Quan, Evan:
>
>         [AMD Official Use Only - General]
>
>         Hi Jack,
>
>         Are you trying to fix the call trace popped up on resuming below?
>
>         It seems mes created some bo for its self test and freed it up
>         later at the final stage of the resuming process.
>
>         All these happened before the in_suspend flag cleared. And
>         that triggered the call trace.
>
>         Is my understanding correct?
>
>         [74084.799260] WARNING: CPU: 2 PID: 2891 at
>         drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:425
>         amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]
>
>         [74084.811019] Modules linked in: nls_iso8859_1 amdgpu(OE)
>         iommu_v2 gpu_sched drm_buddy drm_ttm_helper ttm
>         drm_display_helper drm_kms_helper i2c_algo_bit fb_sys_fops
>         syscopyarea sysfillrect sysimgblt snd_sm
>
>         [74084.811042]  ip_tables x_tables autofs4 hid_logitech_hidpp
>         hid_logitech_dj hid_generic e1000e usbhid ptp uas hid video
>         i2c_i801 ahci pps_core crc32_pclmul i2c_smbus usb_storage
>         libahci wmi
>
>         [74084.914519] CPU: 2 PID: 2891 Comm: kworker/u16:38 Tainted:
>         G        W IOE      6.0.0-custom #1
>
>         [74084.923146] Hardware name: ASUS System Product Name/PRIME
>         Z390-A, BIOS 2004 11/02/2021
>
>         [74084.931074] Workqueue: events_unbound async_run_entry_fn
>
>         [74084.936393] RIP: 0010:amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]
>
>         [74084.942422] Code: 00 4d 85 ed 74 08 49 c7 45 00 00 00 00 00
>         4d 85 e4 74 08 49 c7 04 24 00 00 00 00 5b 41 5c 41 5d 41 5e 41
>         5f 5d c3 cc cc cc cc <0f> 0b e9 39 ff ff ff 3d 00 fe ff ff 0f
>         85 75 96 47 00 ebf
>
>         [74084.961199] RSP: 0000:ffffbed6812ebb90 EFLAGS: 00010202
>
>         [74084.966435] RAX: 0000000000000000 RBX: ffffbed6812ebc50
>         RCX: 0000000000000000
>
>         [74084.973578] RDX: ffffbed6812ebc70 RSI: ffffbed6812ebc60
>         RDI: ffffbed6812ebc50
>
>         [74084.980725] RBP: ffffbed6812ebbb8 R08: 0000000000000000
>         R09: 00000000000001ff
>
>         [74084.987869] R10: ffffbed6812ebb40 R11: 0000000000000000
>         R12: ffffbed6812ebc70
>
>         [74084.995015] R13: ffffbed6812ebc60 R14: ffff963a2945cc00
>         R15: ffff9639c7da5630
>
>         [74085.002160] FS: 0000000000000000(0000)
>         GS:ffff963d1dc80000(0000) knlGS:0000000000000000
>
>         [74085.010262] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>
>         [74085.016016] CR2: 0000000000000000 CR3: 0000000377c0a001
>         CR4: 00000000003706e0
>
>         [74085.023164] DR0: 0000000000000000 DR1: 0000000000000000
>         DR2: 0000000000000000
>
>         [74085.030307] DR3: 0000000000000000 DR6: 00000000fffe0ff0
>         DR7: 0000000000000400
>
>         [74085.037453] Call Trace:
>
>         [74085.039911]  <TASK>
>
>         [74085.042023] amdgpu_mes_self_test+0x385/0x460 [amdgpu]
>
>         [74085.047293] mes_v11_0_late_init+0x44/0x50 [amdgpu]
>
>         [74085.052291] amdgpu_device_ip_late_init+0x50/0x270 [amdgpu]
>
>         [74085.058032] amdgpu_device_resume+0xb0/0x2d0 [amdgpu]
>
>         [74085.063187] amdgpu_pmops_resume+0x37/0x70 [amdgpu]
>
>         [74085.068162] pci_pm_resume+0x68/0x100
>
>         [74085.071836]  ? pci_legacy_resume+0x80/0x80
>
>         [74085.075943] dpm_run_callback+0x4c/0x160
>
>         [74085.079873] device_resume+0xad/0x210
>
>         [74085.083546]  async_resume+0x1e/0x40
>
>         [74085.087046] async_run_entry_fn+0x30/0x120
>
>         [74085.091152] process_one_work+0x21a/0x3f0
>
>         [74085.095173] worker_thread+0x50/0x3e0
>
>         [74085.098845]  ? process_one_work+0x3f0/0x3f0
>
>         [74085.103039]  kthread+0xfa/0x130
>
>         [74085.106189]  ? kthread_complete_and_exit+0x20/0x20
>
>         [74085.110993]  ret_from_fork+0x1f/0x30
>
>         [74085.114576]  </TASK>
>
>         [74085.116773] ---[ end trace 0000000000000000 ]---
>
>         BR
>
>         Evan
>
>         *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
>         <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
>         *Christian König
>         *Sent:* Monday, February 6, 2023 5:00 PM
>         *To:* Xiao, Jack <Jack.Xiao at amd.com>
>         <mailto:Jack.Xiao at amd.com>; Koenig, Christian
>         <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>;
>         amd-gfx at lists.freedesktop.org; Deucher, Alexander
>         <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
>         *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers
>         when DMA is unavailable
>
>         Am 06.02.23 um 09:28 schrieb Xiao, Jack:
>
>             [AMD Official Use Only - General]
>
>                            >> >> It's simply not allowed to free up
>             resources during suspend since those can't be acquired
>             again during resume.
>
>             >> The in_suspend flag is set at the beginning of suspend
>             and unset at the end of resume. It can’t filter the case
>             you mentioned.
>
>
>                            Why not? This is exactly what it should do.
>
>             [Jack] If freeing up resources during resume, it should
>             not hit the issue you described. But only checking
>             in_suspend flag would take these cases as warning.
>
>
>         No, once more: Freeing up or allocating resources between
>         suspend and resume is illegal!
>
>         If you free up a resource during resume you should absolutely
>         hit that, this is intentional!
>
>         Regards,
>         Christian.
>
>
>             Regards,
>
>             Jack
>
>             *From:* Koenig, Christian <Christian.Koenig at amd.com>
>             <mailto:Christian.Koenig at amd.com>
>             *Sent:* Monday, February 6, 2023 4:06 PM
>             *To:* Xiao, Jack <Jack.Xiao at amd.com>
>             <mailto:Jack.Xiao at amd.com>; Christian König
>             <ckoenig.leichtzumerken at gmail.com>
>             <mailto:ckoenig.leichtzumerken at gmail.com>;
>             amd-gfx at lists.freedesktop.org; Deucher, Alexander
>             <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>
>             *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
>             buffers when DMA is unavailable
>
>             Am 06.02.23 um 08:23 schrieb Xiao, Jack:
>
>                 [AMD Official Use Only - General]
>
>                 >> Nope, that is not related to any hw state.
>
>                 can use other flag.
>
>                 >> It's simply not allowed to free up resources during
>                 suspend since those can't be acquired again during resume.
>
>                 The in_suspend flag is set at the beginning of suspend
>                 and unset at the end of resume. It can’t filter the
>                 case you mentioned.
>
>
>             Why not? This is exactly what it should do.
>
>
>                 Do you know the root cause of these cases hitting the
>                 issue? So that we can get an exact point to warn the
>                 freeing up behavior.
>
>
>             Well the root cause are programming errors. See between
>             suspending and resuming you should not allocate nor free
>             memory.
>
>             Otherwise we can run into trouble. And this check here is
>             one part of that, we should probably add another warning
>             during allocation of memory. But this here is certainly
>             correct.
>
>             Regards,
>             Christian.
>
>
>                 Thanks,
>
>                 Jack
>
>                 *From:* Christian König
>                 <ckoenig.leichtzumerken at gmail.com>
>                 <mailto:ckoenig.leichtzumerken at gmail.com>
>                 *Sent:* Friday, February 3, 2023 9:20 PM
>                 *To:* Xiao, Jack <Jack.Xiao at amd.com>
>                 <mailto:Jack.Xiao at amd.com>; Koenig, Christian
>                 <Christian.Koenig at amd.com>
>                 <mailto:Christian.Koenig at amd.com>;
>                 amd-gfx at lists.freedesktop.org; Deucher, Alexander
>                 <Alexander.Deucher at amd.com>
>                 <mailto:Alexander.Deucher at amd.com>
>                 *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
>                 buffers when DMA is unavailable
>
>                 Nope, that is not related to any hw state.
>
>                 It's simply not allowed to free up resources during
>                 suspend since those can't be acquired again during resume.
>
>                 We had a couple of cases now where this was wrong. If
>                 you get a warning from that please fix the code which
>                 tried to free something during suspend instead.
>
>                 Regards,
>                 Christian.
>
>                 Am 03.02.23 um 07:04 schrieb Xiao, Jack:
>
>                     [AMD Official Use Only - General]
>
>                     >> It's simply illegal to free up memory during
>                     suspend.
>
>                     Why? In my understanding, the limit was caused by
>                     DMA shutdown.
>
>                     Regards,
>
>                     Jack
>
>                     *From:* Koenig, Christian
>                     <Christian.Koenig at amd.com>
>                     <mailto:Christian.Koenig at amd.com>
>                     *Sent:* Thursday, February 2, 2023 7:43 PM
>                     *To:* Xiao, Jack <Jack.Xiao at amd.com>
>                     <mailto:Jack.Xiao at amd.com>;
>                     amd-gfx at lists.freedesktop.org; Deucher, Alexander
>                     <Alexander.Deucher at amd.com>
>                     <mailto:Alexander.Deucher at amd.com>
>                     *Subject:* AW: [PATCH] drm/amdgpu: only WARN
>                     freeing buffers when DMA is unavailable
>
>                     Big NAK to this! This warning is not related in
>                     any way to the hw state.
>
>                     It's simply illegal to free up memory during suspend.
>
>                     Regards,
>
>                     Christian.
>
>                     ------------------------------------------------------------------------
>
>                     *Von:*Xiao, Jack <Jack.Xiao at amd.com>
>                     *Gesendet:* Donnerstag, 2. Februar 2023 10:54
>                     *An:*
>                     amd-gfx at lists.freedesktop.org<amd-gfx at lists.freedesktop.org>;
>                     Deucher, Alexander <Alexander.Deucher at amd.com>;
>                     Koenig, Christian <Christian.Koenig at amd.com>
>                     *Cc:* Xiao, Jack <Jack.Xiao at amd.com>
>                     *Betreff:* [PATCH] drm/amdgpu: only WARN freeing
>                     buffers when DMA is unavailable
>
>                     Reduce waringings, only warn when DMA is unavailable.
>
>                     Signed-off-by: Jack Xiao <Jack.Xiao at amd.com>
>                     ---
>                      drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>                      1 file changed, 2 insertions(+), 1 deletion(-)
>
>                     diff --git
>                     a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>                     b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>                     index 2d237f3d3a2e..e3e3764ea697 100644
>                     --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>                     +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>                     @@ -422,7 +422,8 @@ void
>                     amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64
>                     *gpu_addr,
>                              if (*bo == NULL)
>                                      return;
>
>                     -
>                     WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend);
>                     +
>                     WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend
>                     &&
>                     +
>                     !amdgpu_ttm_adev((*bo)->tbo.bdev)->ip_blocks[AMD_IP_BLOCK_TYPE_SDMA].status.hw);
>
>                              if (likely(amdgpu_bo_reserve(*bo, true)
>                     == 0)) {
>                                      if (cpu_addr)
>                     -- 
>                     2.37.3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230210/1981dafd/attachment-0001.htm>