[PATCH] drm/amdgpu: Need to set the baco cap before baco reset

Alex Deucher alexdeucher at gmail.com
Mon Jun 17 14:39:13 UTC 2019


On Mon, Jun 17, 2019 at 7:08 AM Russell, Kent <Kent.Russell at amd.com> wrote:
>
> The issue was limited to one specific model of MI25, I'll see if I can get access to that later today and try your patch out. Thank you Emily!

Where is the crash happening in amdgpu_bios.c?  What pointer is NULL?
Presumably it's one of the asic_funcs?  Can we move the call to after
the common early init?  I think that is cleaner.  Something like the
attached patch?

Alex

>
>  Kent
>
> -----Original Message-----
> From: Deng, Emily <Emily.Deng at amd.com>
> Sent: Sunday, June 16, 2019 11:53 PM
> To: Yang, Philip <Philip.Yang at amd.com>; Russell, Kent <Kent.Russell at amd.com>; Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco reset
>
> Hi Philip,
>      Could you help to try whether the attachment patch could help with the issue encounter? If it is OK, then I will send this patch out to review.
>
> Best wishes
> Emily Deng
>
>
>
> >-----Original Message-----
> >From: Deng, Emily
> >Sent: Monday, June 17, 2019 10:50 AM
> >To: Yang, Philip <Philip.Yang at amd.com>; Russell, Kent
> ><Kent.Russell at amd.com>; Quan, Evan <Evan.Quan at amd.com>; amd-
> >gfx at lists.freedesktop.org
> >Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco
> >reset
> >
> >Hi Philip,
> >     Sorry for introduce this issue for you. From the code, I couldn't
> >see any issue. And I have tested the code in my Vega10, it is OK.  So I
> >think this is the kfd specific issue, but I couldn't reproduce issue on
> >my platform. Could you create an ticket, and assign to me, and share me
> >your platform, so I could debug it and fix it today.
> >
> >Best wishes
> >Emily Deng
> >
> >>-----Original Message-----
> >>From: Yang, Philip <Philip.Yang at amd.com>
> >>Sent: Friday, June 14, 2019 10:16 PM
> >>To: Deng, Emily <Emily.Deng at amd.com>; Russell, Kent
> >><Kent.Russell at amd.com>; Quan, Evan <Evan.Quan at amd.com>; amd-
> >>gfx at lists.freedesktop.org
> >>Subject: Re: [PATCH] drm/amdgpu: Need to set the baco cap before baco
> >>reset
> >>
> >>Hi Emily,
> >>
> >>I am not familiar with vbios and driver init part, just based on my
> >>experience, the patch don't modify amdgpu_get_bios but move
> >>amdgpu_get_bios to amdgpu_device_ip_early_init from
> >>amdgpu_device_init, so amdgpu_get_bios is executed earlier. The kernel error message "BUG:
> >>kernel NULL pointer dereference" means something is not initialized.
> >>Please review the change. This issue blocks rocm release now.
> >>
> >>Regards,
> >>Philip
> >>
> >>On 2019-06-13 11:19 p.m., Deng, Emily wrote:
> >>> Hi Russell,
> >>>       This patch will read vbios, and parse vbios to get the baco
> >>> reset feature
> >>bit.  From the call trace, it shows error in " amdgpu_get_bios ", but
> >>this patch don't modify amdgpu_get_bios, and code before
> >>amdgpu_get_bios. Please first check why it will has error when read vbios.
> >>>
> >>> Best wishes
> >>> Emily Deng
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Russell, Kent <Kent.Russell at amd.com>
> >>>> Sent: Thursday, June 13, 2019 7:11 PM
> >>>> To: Quan, Evan <Evan.Quan at amd.com>; Deng, Emily
> >><Emily.Deng at amd.com>;
> >>>> amd-gfx at lists.freedesktop.org
> >>>> Cc: Deng, Emily <Emily.Deng at amd.com>
> >>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before
> >>>> baco reset
> >>>>
> >>>> Hi Emily,
> >>>>
> >>>> This patch caused a regression on MI25 (Chip 6860, VBIOS
> >>>> 113-D0513700-001) machines where the driver would not boot. Note
> >>>> that this was not seen on
> >>>> Vega10 Frontier (Chip 6863, VBIOS 113-D0501100-X09) or Radeon64
> >>>> (Chip 697f). Reverting this patch resolved the issue with no other
> >>>> work required and was confirmed on all 3 machines.
> >>>>
> >>>> Here is the dmesg:
> >>>>
> >>>> [    3.908653] amdgpu 0000:23:00.0: BAR 6: can't assign [??? 0x00000000
> >>flags
> >>>> 0x20000000] (bogus alignment)
> >>>> [    3.908692] BUG: kernel NULL pointer dereference, address:
> >>>> 0000000000000008
> >>>> [    3.908716] #PF: supervisor read access in kernel mode
> >>>> [    3.908734] #PF: error_code(0x0000) - not-present page
> >>>> [    3.908753] PGD 0 P4D 0
> >>>> [    3.908767] Oops: 0000 [#1] SMP NOPTI
> >>>> [    3.909293] CPU: 8 PID: 409 Comm: kworker/8:1 Not tainted 5.2.0-rc1-
> >kfd-
> >>>> compute-roc-master-10734 #1
> >>>> [    3.909753] Hardware name: Inventec     P47
> >>WC2071019001
> >>>> /P47             , BIOS 0.64 04/09/2018
> >>>> [    3.910534] Workqueue: events work_for_cpu_fn
> >>>> [    3.910953] RIP: 0010:amdgpu_get_bios+0x3aa/0x580 [amdgpu]
> >>>> [    3.911314] Code: c0 48 c7 44 24 5c 00 00 00 00 48 c7 84 24 90 00 00 00
> >00
> >>00
> >>>> 00 00 48 89 d9 48 29 f9 83 c1 3c c1 e9 03 f3 48 ab 49 8b 44 24 38
> >>>> <48> 8b 40 08
> >>>> 48 85 c0 74 24 ba 3c 00 00 00 48 89 de 4c 89 e7 ff d0
> >>>> [    3.912069] RSP: 0018:ffffa27dce28fc50 EFLAGS: 00010212
> >>>> [    3.912502] RAX: 0000000000000000 RBX: ffffa27dce28fcac RCX:
> >>>> 0000000000000000
> >>>> [    3.912980] RDX: 0000000000000000 RSI: 0000000000000082 RDI:
> >>>> ffffa27dce28fce8
> >>>> [    3.913467] RBP: 0000000000000000 R08: 0000000000000001 R09:
> >>>> 000000000000079a
> >>>> [    3.913940] R10: 0000000000000000 R11: 0000000000000001 R12:
> >>>> ffff88d657af0000
> >>>> [    3.914349] R13: ffffffffc0c38120 R14: ffffa27dce28fc68 R15:
> >>>> ffff88d657af0000
> >>>> [    3.914767] FS:  0000000000000000(0000) GS:ffff88d65f400000(0000)
> >>>> knlGS:0000000000000000
> >>>> [    3.915203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [    3.915637] CR2: 0000000000000008 CR3: 0000003e7540a000 CR4:
> >>>> 00000000003406e0
> >>>> [    3.916075] Call Trace:
> >>>> [    3.916522]  ? pcie_capability_clear_and_set_word+0x53/0x80
> >>>> [    3.917014]  amdgpu_device_init+0x923/0x1820 [amdgpu]
> >>>> [    3.917515]  amdgpu_driver_load_kms+0x71/0x310 [amdgpu]
> >>>> [    3.917997]  drm_dev_register+0x113/0x1a0 [drm]
> >>>> [    3.918514]  amdgpu_pci_probe+0xb8/0x150 [amdgpu]
> >>>> [    3.919003]  ? __pm_runtime_resume+0x54/0x70
> >>>> [    3.919270] usb 1-2: New USB device found, idVendor=14dd,
> >>idProduct=1005,
> >>>> bcdDevice= 0.00
> >>>> [    3.919498]  local_pci_probe+0x3d/0x90
> >>>> [    3.919503]  ? __schedule+0x3de/0x690
> >>>> [    3.920374] usb 1-2: New USB device strings: Mfr=1, Product=2,
> >>>> SerialNumber=3
> >>>> [    3.921137]  work_for_cpu_fn+0x10/0x20
> >>>> [    3.922028] usb 1-2: Product: D2CIM-VUSB
> >>>> [    3.922815]  process_one_work+0x159/0x3e0
> >>>> [    3.923633] usb 1-2: Manufacturer: Raritan
> >>>> [    3.923635] usb 1-2: SerialNumber: EFFB212D0A6E32F
> >>>> [    3.924416]  worker_thread+0x22b/0x440
> >>>> [    3.924419]  ? rescuer_thread+0x350/0x350
> >>>> [    3.927812]  kthread+0xf6/0x130
> >>>> [    3.928157] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> >>>> [    3.928365]  ? kthread_destroy_worker+0x40/0x40
> >>>> [    3.929401] ata1.00: ATA-10: INTEL SSDSC2KG960G7, SCV10100, max
> >>>> UDMA/133
> >>>> [    3.930101]  ret_from_fork+0x1f/0x30
> >>>> [    3.930103] Modules linked in: amdgpu(+) crct10dif_pclmul crc32_pclmul
> >>>> ghash_clmulni_intel ast amd_iommu_v2 aesni_intel gpu_sched
> >>>> i2c_algo_bit
> >>>> aes_x86_64 ttm crypto_simd drm_kms_helper cryptd glue_helper
> >>>> syscopyarea sysfillrect ahci sysimgblt libahci fb_sys_fops ixgbe(+)
> >>>> drm dca mdio
> >>>> [    3.930965] ata1.00: 1875385008 sectors, multi 1: LBA48 NCQ (depth 32)
> >>>> [    3.931085] ata1.00: configured for UDMA/133
> >>>> [    3.931809] CR2: 0000000000000008
> >>>> [    3.934723] scsi 0:0:0:0: Direct-Access     ATA      INTEL SSDSC2KG96 0100
> >>PQ:
> >>>> 0 ANSI: 5
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Kent
> >>>>
> >>>> -----Original Message-----
> >>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >>>> Quan, Evan
> >>>> Sent: Monday, May 27, 2019 9:17 PM
> >>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
> >>>> Cc: Deng, Emily <Emily.Deng at amd.com>
> >>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before
> >>>> baco reset
> >>>>
> >>>> Reviewed-by: Evan Quan <evan.quan at amd.com>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >>>>> Emily Deng
> >>>>> Sent: Monday, May 27, 2019 3:43 PM
> >>>>> To: amd-gfx at lists.freedesktop.org
> >>>>> Cc: Deng, Emily <Emily.Deng at amd.com>
> >>>>> Subject: [PATCH] drm/amdgpu: Need to set the baco cap before baco
> >>>>> reset
> >>>>>
> >>>>> For passthrough, after rebooted the VM, driver will do a baco
> >>>>> reset before doing other driver initialization during loading  driver.
> >>>>> For doing the baco reset, it will first check the baco reset capability.
> >>>>> So first need to set the cap from the vbios information or baco
> >>>>> reset won't
> >>>> be enabled.
> >>>>>
> >>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> >>>>> ---
> >>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c         | 24
> >++++++++++--
> >>---
> >>>>> -------
> >>>>>   drivers/gpu/drm/amd/amdgpu/soc15.c                 |  3 ++-
> >>>>>   drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c |  4 ++++
> >>>>>   .../amd/powerplay/hwmgr/vega10_processpptables.c   | 24
> >>>>> ++++++++++++++++++++++
> >>>>>   .../amd/powerplay/hwmgr/vega10_processpptables.h   |  1 +
> >>>>>   5 files changed, 42 insertions(+), 14 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> index b6ded84..7a8c220 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> @@ -1541,6 +1541,17 @@ static int
> >>>>> amdgpu_device_ip_early_init(struct
> >>>>> amdgpu_device *adev)
> >>>>>           if (amdgpu_sriov_vf(adev))
> >>>>>                   adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
> >>>>>
> >>>>> + /* Read BIOS */
> >>>>> + if (!amdgpu_get_bios(adev))
> >>>>> +         return -EINVAL;
> >>>>> +
> >>>>> + r = amdgpu_atombios_init(adev);
> >>>>> + if (r) {
> >>>>> +         dev_err(adev->dev, "amdgpu_atombios_init failed\n");
> >>>>> +         amdgpu_vf_error_put(adev,
> >>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0);
> >>>>> +         return r;
> >>>>> + }
> >>>>> +
> >>>>>           for (i = 0; i < adev->num_ip_blocks; i++) {
> >>>>>                   if ((amdgpu_ip_block_mask & (1 << i)) == 0) {
> >>>>>                           DRM_ERROR("disabled ip block: %d <%s>\n", @@ -
> >>>>> 2591,19 +2602,6 @@ int amdgpu_device_init(struct amdgpu_device
> >>*adev,
> >>>>>                   goto fence_driver_init;
> >>>>>           }
> >>>>>
> >>>>> - /* Read BIOS */
> >>>>> - if (!amdgpu_get_bios(adev)) {
> >>>>> -         r = -EINVAL;
> >>>>> -         goto failed;
> >>>>> - }
> >>>>> -
> >>>>> - r = amdgpu_atombios_init(adev);
> >>>>> - if (r) {
> >>>>> -         dev_err(adev->dev, "amdgpu_atombios_init failed\n");
> >>>>> -         amdgpu_vf_error_put(adev,
> >>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0);
> >>>>> -         goto failed;
> >>>>> - }
> >>>>> -
> >>>>>           /* detect if we are with an SRIOV vbios */
> >>>>>           amdgpu_device_detect_sriov_bios(adev);
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>>> index 78bd4fc..d9fdd95 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>>>> @@ -764,7 +764,8 @@ static bool soc15_need_reset_on_init(struct
> >>>>> amdgpu_device *adev)
> >>>>>           /* Just return false for soc15 GPUs.  Reset does not seem to
> >>>>>            * be necessary.
> >>>>>            */
> >>>>> - return false;
> >>>>> + if (!amdgpu_passthrough(adev))
> >>>>> +         return false;
> >>>>>
> >>>>>           if (adev->flags & AMD_IS_APU)
> >>>>>                   return false;
> >>>>> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
> >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
> >>>>> index ce6aeb5..1d9bb29 100644
> >>>>> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
> >>>>> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
> >>>>> @@ -5311,8 +5311,12 @@ static const struct pp_hwmgr_func
> >>>>> vega10_hwmgr_funcs = {
> >>>>>
> >>>>>   int vega10_hwmgr_init(struct pp_hwmgr *hwmgr)  {
> >>>>> + struct amdgpu_device *adev = hwmgr->adev;
> >>>>> +
> >>>>>           hwmgr->hwmgr_func = &vega10_hwmgr_funcs;
> >>>>>           hwmgr->pptable_func = &vega10_pptable_funcs;
> >>>>> + if (amdgpu_passthrough(adev))
> >>>>> +         return vega10_baco_set_cap(hwmgr);
> >>>>>
> >>>>>           return 0;
> >>>>>   }
> >>>>> diff --git
> >>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
> >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
> >>>>> index b6767d7..83d22cd 100644
> >>>>> ---
> >a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
> >>>>> +++
> >>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
> >>>>> @@ -1371,3 +1371,27 @@ int vega10_get_powerplay_table_entry(struct
> >>>>> pp_hwmgr *hwmgr,
> >>>>>
> >>>>>           return result;
> >>>>>   }
> >>>>> +
> >>>>> +int vega10_baco_set_cap(struct pp_hwmgr *hwmgr) {
> >>>>> + int result = 0;
> >>>>> +
> >>>>> + const ATOM_Vega10_POWERPLAYTABLE *powerplay_table;
> >>>>> +
> >>>>> + powerplay_table = get_powerplay_table(hwmgr);
> >>>>> +
> >>>>> + PP_ASSERT_WITH_CODE((powerplay_table != NULL),
> >>>>> +         "Missing PowerPlay Table!", return -1);
> >>>>> +
> >>>>> + result = check_powerplay_tables(hwmgr, powerplay_table);
> >>>>> +
> >>>>> + PP_ASSERT_WITH_CODE((result == 0),
> >>>>> +                     "check_powerplay_tables failed", return result);
> >>>>> +
> >>>>> + set_hw_cap(
> >>>>> +                 hwmgr,
> >>>>> +                 0 != (le32_to_cpu(powerplay_table->ulPlatformCaps)
> >>>>> & ATOM_VEGA10_PP_PLATFORM_CAP_BACO),
> >>>>> +                 PHM_PlatformCaps_BACO);
> >>>>> + return result;
> >>>>> +}
> >>>>> +
> >>>>> diff --git
> >>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
> >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
> >>>>> index d83ed2a..da5fbec 100644
> >>>>> ---
> >>a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
> >>>>> +++
> >>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
> >>>>> @@ -59,4 +59,5 @@ extern int
> >>>>> vega10_get_number_of_powerplay_table_entries(struct pp_hwmgr
> >>>> *hwmgr);
> >>>>> extern int vega10_get_powerplay_table_entry(struct pp_hwmgr
> >>>>> *hwmgr, uint32_t entry_index,
> >>>>>                   struct pp_power_state *power_state, int
> >>>> (*call_back_func)(struct
> >>>>> pp_hwmgr *, void *,
> >>>>>                                   struct pp_power_state *, void *, uint32_t));
> >>>>> +extern int vega10_baco_set_cap(struct pp_hwmgr *hwmgr);
> >>>>>   #endif
> >>>>> --
> >>>>> 2.7.4
> >>>>>
> >>>>> _______________________________________________
> >>>>> amd-gfx mailing list
> >>>>> amd-gfx at lists.freedesktop.org
> >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >>>> _______________________________________________
> >>>> amd-gfx mailing list
> >>>> amd-gfx at lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx at lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amdgpu-wait-to-fetch-the-vbios-until-after-commo.patch
Type: text/x-patch
Size: 1900 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190617/850b6e59/attachment.bin>


More information about the amd-gfx mailing list