6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before

Mikhail Gavrilov mikhail.v.gavrilov at gmail.com
Sat Jul 20 17:08:43 UTC 2024


Hi,
I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages
in my kernel log since 6.10-rc5.
After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait
[amdgpu]] *ERROR* failed to reg_write_reg_wait".

[ 8972.590502] input: Noble FoKus Mystique (AVRCP) as
/devices/virtual/input/input21
[ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837
[ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837
[12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for
>10000us 1027 times, consider switching to WQ_UNBOUND
[12851.087896] fossilize_repla (45968) used greatest stack depth:
17440 bytes left

Unfortunately, it is not easily reproducible.
Usually it appears when I play several hours in the game "STAR WARS
Jedi: Survivor".
So it is why I bisected it so long.

git bisect start
# status: waiting for both good and bad commits
# bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5
git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454
# good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag
'for-6.10-rc4-tag' of
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect good 50736169ecc8387247fe6a00932852ce7b057083
# bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag
'loongarch-fixes-6.10-2' of
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1
# good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag
'ovl-fixes-6.10-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
git bisect good 264efe488fd82cf3145a3dc625f394c61db99934
# bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag
'scsi-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a
# good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix
UBSAN warning in kv_dpm.c
git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6
# bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag
'amd-drm-fixes-6.10-2024-06-19' of
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1
# bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA
fw for psp v14
git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc
# bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup
MES11 command submission
git bisect bad e356d321d0240663a09b139fa3658ddbca163e27
# first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27]
drm/amdgpu: cleanup MES11 command submission

Author: Christian König <christian.koenig at amd.com>
Date:   Fri May 31 10:56:00 2024 +0200

    drm/amdgpu: cleanup MES11 command submission

    The approach of having a separate WB slot for each submission doesn't
    really work well and for example breaks GPU reset.

    Use a status query packet for the fence update instead since those
    should always succeed we can use the fence of the original packet to
    signal the state of the operation.

    While at it cleanup the coding style.

    Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per
transaction")
    Reviewed-by: Mukul Joshi <mukul.joshi at amd.com>
    Signed-off-by: Christian König <christian.koenig at amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher at amd.com>

And I can confirm after reverting e356d321d024 I played the whole day,
and the "MES failed to respond" error message does not appear anymore.

My hardware specs are: https://linux-hardware.org/?probe=78d8c680db

Christian, can you look into it, please?

-- 
Best Regards,
Mike Gavrilov.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg.zip
Type: application/zip
Size: 57513 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240720/62902470/attachment-0002.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.zip
Type: application/zip
Size: 66515 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240720/62902470/attachment-0003.zip>


More information about the amd-gfx mailing list