6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before

Christian König ckoenig.leichtzumerken at gmail.com
Mon Jul 22 08:43:17 UTC 2024


That's a known issue and we are already working on it.

Regards,
Christian.

Am 20.07.24 um 19:08 schrieb Mikhail Gavrilov:
> Hi,
> I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages
> in my kernel log since 6.10-rc5.
> After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait
> [amdgpu]] *ERROR* failed to reg_write_reg_wait".
>
> [ 8972.590502] input: Noble FoKus Mystique (AVRCP) as
> /devices/virtual/input/input21
> [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837
> [ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> msg=MISC (WAIT_REG_MEM)
> [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> failed to reg_write_reg_wait
> [10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837
> [12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for
>> 10000us 1027 times, consider switching to WQ_UNBOUND
> [12851.087896] fossilize_repla (45968) used greatest stack depth:
> 17440 bytes left
>
> Unfortunately, it is not easily reproducible.
> Usually it appears when I play several hours in the game "STAR WARS
> Jedi: Survivor".
> So it is why I bisected it so long.
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5
> git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454
> # good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag
> 'for-6.10-rc4-tag' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
> git bisect good 50736169ecc8387247fe6a00932852ce7b057083
> # bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag
> 'loongarch-fixes-6.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
> git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1
> # good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag
> 'ovl-fixes-6.10-rc5' of
> git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> git bisect good 264efe488fd82cf3145a3dc625f394c61db99934
> # bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag
> 'scsi-fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a
> # good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix
> UBSAN warning in kv_dpm.c
> git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6
> # bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag
> 'amd-drm-fixes-6.10-2024-06-19' of
> https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
> git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1
> # bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA
> fw for psp v14
> git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc
> # bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup
> MES11 command submission
> git bisect bad e356d321d0240663a09b139fa3658ddbca163e27
> # first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27]
> drm/amdgpu: cleanup MES11 command submission
>
> Author: Christian König <christian.koenig at amd.com>
> Date:   Fri May 31 10:56:00 2024 +0200
>
>      drm/amdgpu: cleanup MES11 command submission
>
>      The approach of having a separate WB slot for each submission doesn't
>      really work well and for example breaks GPU reset.
>
>      Use a status query packet for the fence update instead since those
>      should always succeed we can use the fence of the original packet to
>      signal the state of the operation.
>
>      While at it cleanup the coding style.
>
>      Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per
> transaction")
>      Reviewed-by: Mukul Joshi <mukul.joshi at amd.com>
>      Signed-off-by: Christian König <christian.koenig at amd.com>
>      Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>
> And I can confirm after reverting e356d321d024 I played the whole day,
> and the "MES failed to respond" error message does not appear anymore.
>
> My hardware specs are: https://linux-hardware.org/?probe=78d8c680db
>
> Christian, can you look into it, please?
>



More information about the amd-gfx mailing list