[PATCH] Documentation: add a page on amdgpu debugging

Xu, Feifei Feifei.Xu at amd.com
Wed Mar 27 04:52:00 UTC 2024


[AMD Official Use Only - General]

Reviewed-by: Feifei Xu <Feifei.Xu at amd.com>

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Alex Deucher
Sent: Saturday, March 16, 2024 12:45 AM
To: Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] Documentation: add a page on amdgpu debugging

On Fri, Mar 15, 2024 at 12:07 PM Alex Deucher <alexander.deucher at amd.com> wrote:
>
> Covers GPU page fault debugging and adds a reference to umr.
>
> v2: update client ids to include SQC/G
>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> ---
>  Documentation/gpu/amdgpu/debugging.rst | 79 ++++++++++++++++++++++++++
>  Documentation/gpu/amdgpu/index.rst     |  1 +
>  2 files changed, 80 insertions(+)
>  create mode 100644 Documentation/gpu/amdgpu/debugging.rst
>
> diff --git a/Documentation/gpu/amdgpu/debugging.rst
> b/Documentation/gpu/amdgpu/debugging.rst
> new file mode 100644
> index 000000000000..8b7fdcdf1158
> --- /dev/null
> +++ b/Documentation/gpu/amdgpu/debugging.rst
> @@ -0,0 +1,79 @@
> +===============
> + GPU Debugging
> +===============
> +
> +GPUVM Debugging
> +===============
> +
> +To aid in debugging GPU virtual memory related problems, the driver
> +supports a number of options module paramters:
> +
> +`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
> +
> +`vm_update_mode` - If non-0, use the CPU to update GPU page tables
> +rather than the GPU.
> +
> +
> +Decoding a GPUVM Page Fault
> +===========================
> +
> +If you see a GPU page fault in the kernel log, you can decode it to
> +figure out what is going wrong in your application.  A page fault in
> +your kernel log may look something like this:
> +
> +::
> +
> + [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
> +   in page starting at address 0x0000800102800000 from IH client 0x1b
> + (UTCL2)
> + VM_L2_PROTECTION_FAULT_STATUS:0x00301030
> +       Faulty UTCL2 client ID: TCP (0x8)
> +       MORE_FAULTS: 0x0
> +       WALKER_ERROR: 0x0
> +       PERMISSION_FAULTS: 0x3
> +       MAPPING_ERROR: 0x0
> +       RW: 0x0
> +
> +First you have the memory hub, gfxhub and mmhub.  gfxhub is the
> +memory hub used for graphics, compute, and sdma on some chips.  mmhub
> +is the memory hub used for multi-media and sdma on some chips.
> +
> +Next you have the vmid and pasid.  If the vmid is 0, this fault was
> +likely caused by the kernel driver or firmware.  If the vmid is
> +non-0, it is generally a fault in a user application.  The pasid is
> +used to link a vmid to a system process id.  If the process is active
> +when the fault happens, the process information will be printed.
> +
> +The GPU virtual address that caused the fault comes next.
> +
> +The client ID indicates the GPU block that caused the fault.
> +Some common client IDs:
> +
> +- CB/DB: The color/depth backend of the graphics pipe
> +- CPF: Command Processor Frontend
> +- CPC: Command Processor Compute
> +- CPG: Command Processor Graphics
> +- TCP/SQC/SQG: Shaders
> +- SDMA: SDMA engines
> +- VCN: Video encode/decode engines
> +- JPEG: JPEG engines
> +
> +PERMISSION_FAULTS describe what faults were encountered:
> +
> +- bit 0: the PTE was not valid
> +- bit 1: the PTE read bit was not set
> +- bit 2: the PTE write bit was not set
> +- bit 3: the PTE execute bit was not set
> +
> +Finally, RW, indicates whether the access was a read (0) or a write (1).
> +
> +In the example above, a shader (cliend id = TCP) generated a read (RW
> += 0x0) to an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual
> +address 0x0000800102800000.  The user can then inspect can then
> +inspect their shader

removed the duplicated text above locally.

Alex

> +code and resource descriptor state to determine what caused the GPU page fault.
> +
> +UMR
> +===
> +
> +`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general
> +purpose GPU debugging and diagnostics tool.  Please see the umr
> +documentation for more information about its capabilities.
> diff --git a/Documentation/gpu/amdgpu/index.rst
> b/Documentation/gpu/amdgpu/index.rst
> index 912e699fd373..847e04924030 100644
> --- a/Documentation/gpu/amdgpu/index.rst
> +++ b/Documentation/gpu/amdgpu/index.rst
> @@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
>     ras
>     thermal
>     driver-misc
> +   debugging
>     amdgpu-glossary
> --
> 2.44.0
>


More information about the amd-gfx mailing list