[Bug report] Regression with kernel v6.13-rc2

Thu Dec 19 00:52:35 UTC 2024

[Public]

Hi Tobais,
-----Original Message-----
From: Li, Yunxiang (Teddy) <Yunxiang.Li at amd.com>
Sent: Thursday, December 19, 2024 12:46 AM
To: Deucher, Alexander <Alexander.Deucher at amd.com>; Tobias Klausmann <klausman at schwarzvogel.de>; amd-gfx at lists.freedesktop.org
Cc: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>
Subject: RE: [Bug report] Regression with kernel v6.13-rc2

[Public]

> From: Tobias Klausmann <klausman at schwarzvogel.de>
> Sent: Wednesday, December 18, 2024 10:54 Hi!
>
> I have been hitting kernel messages from AMDGPU since v6.13-rc2, for
> example:
>
> [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks
> suppressed [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024]
> amdgpu 0000:03:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2
> client ID: CPC (0x5)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MORE_FAULTS: 0x0
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> WALKER_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> PERMISSION_FAULTS: 0x5
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MAPPING_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page
> fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024]
> amdgpu 0000:03:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2
> client ID: CPC (0x5)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MORE_FAULTS: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> WALKER_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> PERMISSION_FAULTS: 0x3
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MAPPING_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page
> fault
> (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024]
> amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024]
> amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024]
> amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
>
> This happens when loading nontrivial (~6g) models using PyTorch. There
> is no immediate crash, but if exercise the model for a few minutes,
> evetually, the GPU crashes (sometimes the whole machine).
could you help try with this patch ?
https://lists.freedesktop.org/archives/amd-gfx/2024-December/118058.html

Thanks
Jesse

>
> I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on this commit:
>
> commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1
> Author: Jesse.zhang at amd.com <Jesse.zhang at amd.com>
> Date:   Thu Dec 5 17:41:26 2024 +0800
>
>     drm/amdkfd: pause autosuspend when creating pdd
>
>     When using MES creating a pdd will require talking to the GPU to
>     setup the relevant context. The code here forgot to wake up the GPU
>     in case it was in suspend, this causes KVM to EFAULT for passthrough
>     GPU for example. This issue can be masked if the GPU was woken up by
>     other things (e.g. opening the KMS node) first and have not yet gone to sleep.
>
>     v4: do the allocation of proc_ctx_bo in a lazy fashion
>     when the first queue is created in a process (Felix)
>
>     Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
>     Reviewed-by: Yunxiang Li <Yunxiang.Li at amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>     Cc: stable at vger.kernel.org
>
>  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 15 ++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c           | 23 ++--------------------
>  2 files changed, 17 insertions(+), 21 deletions(-)
>
> I am not sure what the causal relation ship between the commit and the
> messages I get is, but I thought this report might be useful.

If I had to guess I'd say that somewhere used the pdd->proc_ctx_gpu_addr before add_queue_mes is called, and since this patch moved the init into add_queue_mes null is passed to the GPU and we get the page fault.

+Alex as well for awareness.

> Since I am not subscribed to the list, please CC me on replies. Thank you!
>
> Best,
> Tobias