[PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU

Deucher, Alexander Alexander.Deucher at amd.com
Mon Nov 2 15:38:51 UTC 2020


[AMD Public Use]

That's the IP discovery table.  The Object header is part of atombios.  Do you have an vbios image from that system?

Alex


________________________________
From: Yin, Tianci (Rico) <Tianci.Yin at amd.com>
Sent: Sunday, November 1, 2020 10:46 PM
To: Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Cui, Flora <Flora.Cui at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Long, Gang <Gang.Long at amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU


[AMD Public Use]

Thanks very much Hawking!


Hi Alex,

The amdgpu_device_ip_get_ip_block() depends on the adev->ip_blocks that initialized by nv_set_ip_blocks().
In nv_set_ip_blocks(), whether add dm_ip_block depends on amdgpu_device_has_dc_support().
And amdgpu_device_has_dc_support() depends on amdgpu_device_asic_has_dc_support(),
So amdgpu_device_asic_has_dc_support() can't conditional on amdgpu_device_ip_get_ip_block(), it makes a dependency loop.

I have just checked the atombios object table in the headless VBIOS, and find DCN and VCN are still there.
[  174.815714] [drm:amdgpu_discovery_reg_base_init [amdgpu]] DMU(271) #0 v2.0.2:
[  174.815771] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x00000012
[  174.815829] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x000000c0
[  174.815887] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x000034c0
[  174.815945] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x00009000
[  174.816002] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x02403c00
[  174.821854] [drm:amdgpu_discovery_reg_base_init [amdgpu]] UVD(12) #0 v2.0.0:
[  174.821908] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x00007800
[  174.821962] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x00007e00
[  174.822017] [drm:amdgpu_discovery_reg_base_init [amdgpu]] 0x02403000
So I think DAL driver can't tell it's a normal ASIC or headless ASIC.

Thanks!
Rico

________________________________
From: Deucher, Alexander <Alexander.Deucher at amd.com>
Sent: Friday, October 30, 2020 22:18
To: Zhang, Hawking <Hawking.Zhang at amd.com>; Yin, Tianci (Rico) <Tianci.Yin at amd.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Cui, Flora <Flora.Cui at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Long, Gang <Gang.Long at amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU


[AMD Public Use]

Maybe it would be easier to check if the DCE IP exists? E.g.,
if (!amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_DCE) ||
(!amdgpu_device_has_dc_support(adev))
...

Also do we even need to skip DCN init for these cards, or will the display code handle it automatically since there will be no displays in the atombios object table.  We didn't do anything special for display with the polaris blockchain cards.

Alex

________________________________
From: Zhang, Hawking <Hawking.Zhang at amd.com>
Sent: Friday, October 30, 2020 9:00 AM
To: Yin, Tianci (Rico) <Tianci.Yin at amd.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Cui, Flora <Flora.Cui at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Long, Gang <Gang.Long at amd.com>
Subject: RE: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU


[AMD Public Use]


[AMD Public Use]



I see, thanks for the clarifying. The patch is



Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>



I was thinking to have a new flag like AMD_IS_HEADLESS in amd_chip_flag, but after think it a bit more, that’s just complicated the case, I agree with your current approach.



Regards,
Hawking

From: Yin, Tianci (Rico) <Tianci.Yin at amd.com>
Sent: Friday, October 30, 2020 20:05
To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Cui, Flora <Flora.Cui at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Long, Gang <Gang.Long at amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU



[AMD Public Use]



Hi Hawking,



amdgpu_device_asic_has_dc_support() is referrenced by amdgpu_pci_probe(),

at that point, adev has not been allocated yet.



My change can make it to right code path.

int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
{

...

if (!amdgpu_device_has_dc_support(adev))

drm_helper_hpd_irq_event(dev);               //right path for headless SKU

else

drm_kms_helper_hotplug_event(dev);      //wrong path for headless SKU

...

}



Thanks!

Rico



________________________________

From: Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>
Sent: Friday, October 30, 2020 19:48
To: Yin, Tianci (Rico) <Tianci.Yin at amd.com<mailto:Tianci.Yin at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>>
Cc: Tuikov, Luben <Luben.Tuikov at amd.com<mailto:Luben.Tuikov at amd.com>>; Deucher, Alexander <Alexander.Deucher at amd.com<mailto:Alexander.Deucher at amd.com>>; Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>; Cui, Flora <Flora.Cui at amd.com<mailto:Flora.Cui at amd.com>>; Xu, Feifei <Feifei.Xu at amd.com<mailto:Feifei.Xu at amd.com>>; Long, Gang <Gang.Long at amd.com<mailto:Gang.Long at amd.com>>; Yin, Tianci (Rico) <Tianci.Yin at amd.com<mailto:Tianci.Yin at amd.com>>
Subject: RE: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU



[AMD Public Use]

I'm not sure I get your point on changing amdgpu_device_has_dc_support() interface by adding new parameter. but it seems to me change input parameter from pdev to adev for nv_is_headless_sku is more straightforward.

Regards,
Hawking
-----Original Message-----
From: Tianci Yin <tianci.yin at amd.com<mailto:tianci.yin at amd.com>>
Sent: Friday, October 30, 2020 19:32
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Cc: Tuikov, Luben <Luben.Tuikov at amd.com<mailto:Luben.Tuikov at amd.com>>; Deucher, Alexander <Alexander.Deucher at amd.com<mailto:Alexander.Deucher at amd.com>>; Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>; Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>; Cui, Flora <Flora.Cui at amd.com<mailto:Flora.Cui at amd.com>>; Xu, Feifei <Feifei.Xu at amd.com<mailto:Feifei.Xu at amd.com>>; Long, Gang <Gang.Long at amd.com<mailto:Gang.Long at amd.com>>; Yin, Tianci (Rico) <Tianci.Yin at amd.com<mailto:Tianci.Yin at amd.com>>
Subject: [PATCH] drm/amdgpu: fix NULL pointer crash on navi10 headless SKU

From: "Tianci.Yin" <tianci.yin at amd.com<mailto:tianci.yin at amd.com>>

The crash caused by the NULL pointer of
adev->ddev.mode_config.funcs in drm_kms_helper_hotplug_event(),
but this function should not be called on headless SKU.

Fix the mismatch between the return value of
amdgpu_device_has_dc_support() and the real DCN supporting state to avoid calling to drm_kms_helper_hotplug_event() in amdgpu_device_resume().

Change-Id: I3a3d387e6ab5b774abb3911ea1bf6de60797759d
Signed-off-by: Tianci.Yin <tianci.yin at amd.com<mailto:tianci.yin at amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 10 ++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/nv.c             |  2 +-
 drivers/gpu/drm/amd/amdgpu/nv.h             |  1 +
 6 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index ba65d4f2ab67..f0183271456f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1090,7 +1090,7 @@ void amdgpu_device_indirect_wreg64(struct amdgpu_device *adev,
                                    u32 pcie_index, u32 pcie_data,
                                    u32 reg_addr, u64 reg_data);

-bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type);
+bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type,
+struct pci_dev *pdev);
 bool amdgpu_device_has_dc_support(struct amdgpu_device *adev);

 int emu_soc_asic_init(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1fe850e0a94d..323ed69032a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2960,11 +2960,12 @@ static void amdgpu_device_detect_sriov_bios(struct amdgpu_device *adev)
  * amdgpu_device_asic_has_dc_support - determine if DC supports the asic
  *
  * @asic_type: AMD asic type
+ * @pdev: pointer to pci_dev instance
  *
  * Check if there is DC (new modesetting infrastructre) support for an asic.
  * returns true if DC has support, false if not.
  */
-bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type)
+bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type,
+struct pci_dev *pdev)
 {
         switch (asic_type) {
 #if defined(CONFIG_DRM_AMD_DC)
@@ -3000,9 +3001,14 @@ bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type)
         case CHIP_VEGA20:
 #if defined(CONFIG_DRM_AMD_DC_DCN)
         case CHIP_RAVEN:
+               return amdgpu_dc != 0;
         case CHIP_NAVI10:
         case CHIP_NAVI14:
         case CHIP_NAVI12:
+               if (nv_is_headless_sku(pdev))
+                       return false;
+               else
+                       return amdgpu_dc != 0;
         case CHIP_RENOIR:
 #endif
 #if defined(CONFIG_DRM_AMD_DC_DCN3_0)
@@ -3033,7 +3039,7 @@ bool amdgpu_device_has_dc_support(struct amdgpu_device *adev)
         if (amdgpu_sriov_vf(adev) || adev->enable_virtual_display)
                 return false;

-       return amdgpu_device_asic_has_dc_support(adev->asic_type);
+       return amdgpu_device_asic_has_dc_support(adev->asic_type, adev->pdev);
 }


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index 9e92d2a070ac..97014458d7de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -516,7 +516,7 @@ uint32_t amdgpu_display_supported_domains(struct amdgpu_device *adev,
          */
         if ((bo_flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) &&
             amdgpu_bo_support_uswc(bo_flags) &&
-           amdgpu_device_asic_has_dc_support(adev->asic_type)) {
+           amdgpu_device_asic_has_dc_support(adev->asic_type, adev->pdev)) {
                 switch (adev->asic_type) {
                 case CHIP_CARRIZO:
                 case CHIP_STONEY:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4b78ecfd35f7..b23110241267 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1117,7 +1117,7 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
         bool supports_atomic = false;

         if (!amdgpu_virtual_display &&
-           amdgpu_device_asic_has_dc_support(flags & AMD_ASIC_MASK))
+           amdgpu_device_asic_has_dc_support(flags & AMD_ASIC_MASK, pdev))
                 supports_atomic = true;

         if ((flags & AMD_EXP_HW_SUPPORT) && !amdgpu_exp_hw_support) { diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 026e0a8fd526..97446ae75b0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -493,7 +493,7 @@ void nv_set_virt_ops(struct amdgpu_device *adev)
         adev->virt.ops = &xgpu_nv_virt_ops;
 }

-static bool nv_is_headless_sku(struct pci_dev *pdev)
+bool nv_is_headless_sku(struct pci_dev *pdev)
 {
         if ((pdev->device == 0x731E &&
             (pdev->revision == 0xC6 || pdev->revision == 0xC7)) || diff --git a/drivers/gpu/drm/amd/amdgpu/nv.h b/drivers/gpu/drm/amd/amdgpu/nv.h index 515d67bf249f..7880ad0073c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.h
+++ b/drivers/gpu/drm/amd/amdgpu/nv.h
@@ -29,6 +29,7 @@
 void nv_grbm_select(struct amdgpu_device *adev,
                     u32 me, u32 pipe, u32 queue, u32 vmid);  void nv_set_virt_ops(struct amdgpu_device *adev);
+bool nv_is_headless_sku(struct pci_dev *pdev);
 int nv_set_ip_blocks(struct amdgpu_device *adev);  int navi10_reg_base_init(struct amdgpu_device *adev);  int navi14_reg_base_init(struct amdgpu_device *adev);
--
2.17.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20201102/1b20b6e9/attachment-0001.htm>


More information about the amd-gfx mailing list