Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

Sun Apr 27 01:01:12 UTC 2025

On Thu Apr 24, 2025 at 4:44 PM BST, Alex Deucher wrote:
> On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov <alexey.klimov at linaro.org> wrote:
>>
>> On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
>> > On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov <alexey.klimov at linaro.org> wrote:
>> >>
>> >> On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
>> >> > On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan <fugang.duan at cixtech.com> wrote:
>> >> >>
>> >> >> 发件人: Alex Deucher <alexdeucher at gmail.com> 发送时间: 2025年4月16日 22:49
>> >> >> >收件人: Alexey Klimov <alexey.klimov at linaro.org>
>> >> >> >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov <alexey.klimov at linaro.org> wrote:
>> >> >> >>
>> >> >> >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
>> >> >> >> > 发件人: Alexey Klimov <alexey.klimov at linaro.org> 发送时间: 2025年4月16
>> >> >> >日 2:28
>> >> >> >> >>#regzbot introduced: v6.12..v6.13
>> >> >> >> >>The only change related to hdp_v5_0_flush_hdp() was
>> >> >> >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>> >> >> >> >>
>> >> >> >> >>Reverting that commit ^^ did help and resolved that problem. Before
>>
>> [..]
>>
>> >> > OK.  that patch won't change anything then.  Can you try this patch instead?
>> >>
>> >> Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
>> >>
>> >> So I tested that patch, thank you, and some other different configurations --
>> >> nothing helped. Exactly the same behaviour with the same backtrace.
>> >
>> > Did you test the first (4k check) or the second (don't remap on ARM) patch?
>>
>> The second one. I think you mentioned that first one won't help for 4k pages.
>>
>>
>> >> So it seems that it is firmware problem after all?
>> >
>> > There is no GPU firmware involved in this operation.  It's just a
>> > posted write.  E.g., we write to a register to flush the HDP write
>> > queue and then read the register back to make sure the write posted.
>> > If the second patch didn't help, then perhaps there is some issue with
>> > MMIO access on your platform?
>>
>> I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
>>
>> Completely out of the blue, based on nothing, do you think that
>> adding delay/some mem barrier between write and read might help?
>> I wonder if host data path code should be executed during common desktop
>> usage as a common user then why it doesn't break later. But yeah, I also
>> think this is this motherboard problem. Thank you.
>
> I think I found the problem.  The previous patch wasn't doing what I
> expected.  Please try this patch instead.

This one works!

[    4.483750] [drm] amdgpu kernel modesetting enabled.
[    4.491985] amdgpu: IO link not available for non x86 platforms
[    4.497189] amdgpu: Virtual CRAT table created for CPU
[    4.497559] amdgpu: Topology: Add CPU node
[    4.509623] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 0 <nv_common>
[    4.512905] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 1 <gmc_v10_0>
[    4.513254] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 2 <navi10_ih>
[    4.513595] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 3 <psp>
[    4.513932] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 4 <smu>
[    4.514278] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dm>
[    4.514625] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 6 <gfx_v10_0>
[    4.514980] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 7 <sdma_v5_2>
[    4.515334] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 8 <vcn_v3_0>
[    4.515699] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0>
[    4.516087] amdgpu 0000:c3:00.0: amdgpu: Fetched VBIOS from VFCT
[    4.516466] amdgpu: ATOM BIOS: 113-V502MECH-0OC
[    4.749748] amdgpu 0000:c3:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    4.777435] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1810000000-0x18101fffff 64bit pref]: releasing
[    4.793256] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: releasing
[    4.844639] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x19ffffffff 64bit pref]: assigned
[    4.849774] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1a00000000-0x1a001fffff 64bit pref]: assigned
[    4.957411] amdgpu 0000:c3:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    4.967618] amdgpu 0000:c3:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    4.992963] [drm] amdgpu: 8176M of VRAM memory ready
[    5.004032] [drm] amdgpu: 7888M of GTT memory ready.
[    6.224159] amdgpu 0000:c3:00.0: amdgpu: STB initialized to 2048 entries
[    6.284328] amdgpu 0000:c3:00.0: amdgpu: Found VCN firmware Version ENC: 1.33 DEC: 4 VEP: 0 Revision: 3
[    6.361142] amdgpu 0000:c3:00.0: amdgpu: reserve 0xa00000 from 0x81fd000000 for PSP TMR
[    6.471231] amdgpu 0000:c3:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    6.492967] amdgpu 0000:c3:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    6.492993] amdgpu 0000:c3:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b3100 (59.49.0)
[    6.513659] amdgpu 0000:c3:00.0: amdgpu: SMU driver if version not matched
[    6.513699] amdgpu 0000:c3:00.0: amdgpu: use vbios provided pptable
[    6.588418] amdgpu 0000:c3:00.0: amdgpu: SMU is initialized successfully!
[    6.800975] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    6.806709] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    6.813516] amdgpu: Virtual CRAT table created for GPU
[    6.819229] amdgpu: Topology: Add dGPU node [0x73ff:0x1002]
[    6.824865] kfd kfd: amdgpu: added device 1002:73ff
[    6.829821] amdgpu 0000:c3:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28
[    6.838355] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    6.846007] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[    6.853658] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[    6.861398] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[    6.869137] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[    6.876877] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[    6.884615] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[    6.892356] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[    6.900094] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[    6.907921] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[    6.915748] amdgpu 0000:c3:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[    6.923663] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[    6.931050] amdgpu 0000:c3:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[    6.938439] amdgpu 0000:c3:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[    6.946089] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[    6.953916] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[    6.961742] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[    6.970485] amdgpu 0000:c3:00.0: amdgpu: Using BACO for runtime pm
[    6.977167] [drm] Initialized amdgpu 3.63.0 for 0000:c3:00.0 on minor 0
[    7.234638] amdgpu 0000:c3:00.0: [drm] fb0: amdgpudrmfb frame buffer device
root at orion:~ # uname -a
Linux orion 6.15.0-rc3test6+ #1 SMP Sun Apr 27 01:12:10 BST 2025 aarch64 GNU/Linux

Thank you for taking a look into this.

Best regards,
Alexey