[PATCH 1/2] Documentation/gpu: Document how to narrow down display issues

Thu Oct 17 19:00:16 UTC 2024

On 10/16/2024 22:34, Rodrigo Siqueira wrote:
> The amdgpu driver is composed of multiple components, each of which can
> be a source of some specific problem that the user/developer can see.
> This commit introduces steps to narrow down and collect display
> information.
> 
> Cc: Leo Li <sunpeng.li at amd.com>
> Cc: Aurabindo Pillai <aurabindo.pillai at amd.com>
> Cc: Hamza Mahfooz <hamza.mahfooz at amd.com>
> Cc: Harry Wentland <harry.wentland at amd.com>
> Cc: Mario Limonciello <mario.limonciello at amd.com>
> Cc: Christian Konig <christian.koenig at amd.com>
> Cc: Alex Deucher <alexander.deucher at amd.com>
> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira at amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello at amd.com>
> ---
>   Documentation/gpu/amdgpu/display/dc-debug.rst | 187 ++++++++++++++++++
>   1 file changed, 187 insertions(+)
> 
> diff --git a/Documentation/gpu/amdgpu/display/dc-debug.rst b/Documentation/gpu/amdgpu/display/dc-debug.rst
> index 817631b1dbf3..013f63b271f3 100644
> --- a/Documentation/gpu/amdgpu/display/dc-debug.rst
> +++ b/Documentation/gpu/amdgpu/display/dc-debug.rst
> @@ -2,6 +2,181 @@
>   Display Core Debug tools
>   ========================
>   
> +In this section, you will find helpful information on debugging the amdgpu
> +driver from the display perspective. This page introduces debug mechanisms and
> +procedures to help you identify if some issues are related to display code.
> +
> +Narrow down display issues
> +==========================
> +
> +Since the display is the driver's visual component, it is common to see users
> +reporting issues as a display when another component causes the problem. This
> +section equips users to determine if a specific issue was caused by the display
> +component or another part of the driver.
> +
> +DC dmesg important messages
> +---------------------------
> +
> +The dmesg log is the first source of information to be checked, and amdgpu
> +takes advantage of this feature by logging some valuable information. When
> +looking for the issues associated with amdgpu, remember that each component of
> +the driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and this
> +information can be found in the dmesg log. In this sense, look for the part of
> +the log that looks like the below log snippet::
> +
> +  [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
> +  [    4.254718] [drm] register mmio base: 0xFCB00000
> +  [    4.254918] [drm] register mmio size: 1048576
> +  [    4.260095] [drm] add ip block number 0 <soc21_common>
> +  [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
> +  [    4.260510] [drm] add ip block number 2 <ih_v6_0>
> +  [    4.260696] [drm] add ip block number 3 <psp>
> +  [    4.260878] [drm] add ip block number 4 <smu>
> +  [    4.261057] [drm] add ip block number 5 <dm>
> +  [    4.261231] [drm] add ip block number 6 <gfx_v11_0>
> +  [    4.261402] [drm] add ip block number 7 <sdma_v6_0>
> +  [    4.261568] [drm] add ip block number 8 <vcn_v4_0>
> +  [    4.261729] [drm] add ip block number 9 <jpeg_v4_0>
> +  [    4.261887] [drm] add ip block number 10 <mes_v11_0>
> +
> +From the above example, you can see the line that reports that `<dm>`,
> +(**Display Manager**), was loaded, which means that display can be part of the
> +issue. If you do not see that line, something else might have failed before
> +amdgpu loads the display component, indicating that we don't have a
> +display issue.
> +
> +After you identified that the DM was loaded correctly, you can check for the
> +display version of the hardware in use, which can be retrieved from the dmesg
> +log with the command::
> +
> +  dmesg | grep -i 'display core'
> +
> +This command shows a message that looks like this::
> +
> +  [    4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2
> +
> +This message has two key pieces of information:
> +
> +* **The DC version (e.g., v3.2.285)**: Display developers release a new DC version
> +  every week, and this information can be advantageous in a situation where a
> +  user/developer must find a good point versus a bad point based on a tested
> +  version of the display code. Remember from page :ref:`Display Core <amdgpu-display-core>`,
> +  that every week the new patches for display are heavily tested with IGT and
> +  manual tests.
> +* **The DCN version (e.g., DCN 3.2)**: The DCN block is associated with the
> +  hardware generation, and the DCN version conveys the hardware generation that
> +  the driver is currently running. This information helps to narrow down the
> +  code debug area since each DCN version has its files in the DC folder per DCN
> +  component (from the example, the developer might want to focus on
> +  files/folders/functions/structs with the dcn32 label might be executed).
> +  However, keep in mind that DC reuses code across different DCN versions; for
> +  example, it is expected to have some callbacks set in one DCN that are the same
> +  as those from another DCN. In summary, use the DCN version just as a guide.
> +
> +From the dmesg file, it is also possible to get the ATOM bios code by using::
> +
> +  dmesg  | grep -i 'ATOM BIOS'
> +
> +Which generates an output that looks like this::
> +
> +  [    4.274534] amdgpu: ATOM BIOS: 113-D7020100-102
> +
> +This type of information is useful to be reported.
> +
> +Avoid loading display core
> +--------------------------
> +
> +Sometimes, it might be hard to figure out which part of the driver is causing
> +the issue; if you suspect that the display is not part of the problem and your
> +bug scenario is simple (e.g., some desktop configuration) you can try to remove
> +the display component from the equation. First, you need to identify `dm` ID
> +from the dmesg log; for example, search for the following log::
> +
> +  [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
> +  [..]
> +  [    4.260095] [drm] add ip block number 0 <soc21_common>
> +  [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
> +  [..]
> +  [    4.261057] [drm] add ip block number 5 <dm>
> +
> +Notice from the above example that the `dm` id is 5 for this specific hardware.
> +Next, you need to run the following binary operation to identify the IP block
> +mask::
> +
> +  0xffffffff & ~(1 << [DM ID])
> +
> +From our example the IP mask is::
> +
> + 0xffffffff & ~(1 << 5) = 0xffffffdf
> +
> +Finally, to disable DC, you just need to set the below parameter in your
> +bootloader::
> +
> + amdgpu.ip_block_mask = 0xffffffdf
> +
> +If you can boot your system with the DC disabled and still see the issue, it
> +means you can rule DC out of the equation. However, if the bug disappears, you
> +still need to consider the DC part of the problem and keep narrowing down the
> +issue. In some scenarios, disabling DC is impossible since it might be
> +necessary to use the display component to reproduce the issue (e.g., play a
> +game).
> +
> +**Note: This will probably lead to the absence of a display output.**
> +
> +Display flickering
> +------------------
> +
> +Display flickering might have multiple causes; one is the lack of proper power
> +to the GPU or problems in the DPM switches. A good first generic verification
> +is to set the GPU to use high voltage::
> +
> +   bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"
> +
> +The above command sets the GPU/APU to use the maximum power allowed which
> +disables DPM switches. If forcing DPM levels high does not fix the issue, it
> +is less likely that the issue is related to power management. If the issue
> +disappears, there is a good chance that other components might be involved, and
> +the display should not be ignored since this could be a DPM issues. From the
> +display side, if the power increase fixes the issue, it is worth debugging the
> +clock configuration and the pipe split police used in the specific
> +configuration.
> +
> +Display artifacts
> +-----------------
> +
> +Users may see some screen artifacts that can be categorized into two different
> +types: localized artifacts and general artifacts. The localized artifacts
> +happen in some specific areas, such as around the UI window corners; if you see
> +this type of issue, there is a considerable chance that you have a userspace
> +problem, likely Mesa or similar. The general artifacts usually happen on the
> +entire screen. They might be caused by a misconfiguration at the driver level
> +of the display parameters, but the userspace might also cause this issue. One
> +way to identify the source of the problem is to take a screenshot or make a
> +desktop video capture when the problem happens; after checking the
> +screenshot/video recording, if you don't see any of the artifacts, it means
> +that the issue is likely on the the driver side. If you can still see the
> +problem in the data collected, it is an issue that probably happened during
> +rendering, and the display code just got the framebuffer already corrupted.
> +
> +Disabling/Enabling specific features
> +====================================
> +
> +DC has a struct named `dc_debug_options`, which is statically initialized by
> +all DCE/DCN components based on the specific hardware characteristic. This
> +structure usually facilitates the bring-up phase since developers can start
> +with many disabled features and enable them individually. This is also an
> +important debug feature since users can change it when debugging specific
> +issues.
> +
> +For example, dGPU users sometimes see a problem where a horizontal fillet of
> +flickering happens in some specific part of the screen. This could be an
> +indication of Sub-Viewport issues; after the users identified the target DCN,
> +they can set the `force_disable_subvp` field to true in the statically
> +initialized version of `dc_debug_options` to see if the issue gets fixed. Along
> +the same lines, users/developers can also try to turn off `fams2_config` and
> +`enable_single_display_2to1_odm_policy`. In summary, the `dc_debug_options` is
> +an interesting form for identifying the problem.
> +
>   DC Visual Confirmation
>   ======================
>   
> @@ -76,6 +251,18 @@ change in real-time by using something like::
>   When reporting a bug related to DC, consider attaching this log before and
>   after you reproduce the bug.
>   
> +Collect Firmware information
> +============================
> +
> +When reporting issues, it is important to have the firmware information since
> +it can be helpful for debugging purposes. To get all the firmware information,
> +use the command::
> +
> +  cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
> +
> +From the display perspective, pay attention to the firmware of the DMCU and
> +DMCUB.
> +
>   DMUB Firmware Debug
>   ===================
>