Looking for pointers on diagnosing ring test failure in amdgpu

Matthew Macy mmacy at nextbsd.org
Tue Jun 14 20:02:42 UTC 2016



---- On Tue, 14 Jun 2016 06:02:09 -0700 Alex Deucher  wrote ---- 
>On Tue, Jun 14, 2016 at 4:10 AM, Christian König 
><christian.koenig at amd.com> wrote: 
>> Hi Matthew, 
>> 
>> see inline below. 
>> 
>> Am 14.06.2016 um 00:03 schrieb Matthew Macy: 
>>> 
>>> ---- On Mon, 13 Jun 2016 01:35:34 -0700 Christian König 
>>> <christian.koenig at amd.com> wrote ---- 
>>> > Hi Matthew, 
>>> > 
>>> > sounds like the UVD block doesn't want to initialize. No idea off hand 
>>> > why, could be anything. I would need the hardware here for a closer 
>>> > inspection. 
>>> > 
>>> > For a workaround you can try to disable the UVD blokc using the 
>>> > ip_block_mask module parameter (it's a bitmask of enabled blocks e.g. 
>>> > 0xffffffff means all blocks enabled, UVD is bit 7 on Carrizo IIRC). 
>>> 
>>> 
>>> When I clear bit 7 I get the following now: 
>>> 
>>> Jun 14 07:58:18 trainwreck kernel: drmn0: fence driver on ring 10 use gpu 
>>> addr 0x00000000400000b0, cpu addr 0x0xfffff800bd4320b0 
>>> Jun 14 07:58:18 trainwreck kernel: drmn0: fence driver on ring 11 use gpu 
>>> addr 0x00000000400000c0, cpu addr 0x0xfffff800bd4320c0 
>>> Jun 14 07:58:19 trainwreck kernel: drmn0: SMU check loaded firmware 
>>> failed, expecting 0x17f, getting 0x0[drm:0xffffffff826d4dc4s] *ERROR* 
>>> amdgpu: smc start failed 
>>> Jun 14 07:58:19 trainwreck kernel: [drm:0xffffffff8269fc40s] *ERROR* 
>>> hw_init 3 failed -22 
>>> Jun 14 07:58:19 trainwreck kernel: drmn0: amdgpu_init failed 
>> 
>> 
>> UVD is optional (as long as you don't want to do hardware video decoding) 
>> but the SMU isn't. Alex, Rex any idea what's going wrong here? 
>> 
> 
>Seems like maybe the two issues are related. Maybe some general MMIO 
>issue on that particular system or a issue with the MC or gart setup? 
>The firmware that the SMU loads is stored in gart and all of the 
>engine rings are in gart. Maybe a problem with the IOMMU setup on the 
>CPU? 


The two issues are definitely related. They both go through a bounded delay loop waiting for some operation to complete.

By default FreeBSD doesn't use the IOMMU on x86 so that's not an issue. 

One thing that is different between the Elitebook (A12) and the the Thinkpad (A10) is that the Thinkpad has both integrated and discrete GPUs. 0x6660 matches Hainan in drm_pciids.h which I guess is GCN 1.0? Could that possibly be an issue? I know amdgpu doesn't support pre GCN 1.1 currently, so I would assume it would just be ignored. Nonetheless, I thought I should bring it up just in case.


vgapci0 at pci0:0:1:0:	class=0x030000 card=0x511617aa chip=0x98741002 rev=0xc5 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Carrizo'
    class      = display
    subclass   = VGA
<...>
vgapci1 at pci0:5:0:0:	class=0x038000 card=0x511617aa chip=0x66601002 rev=0x83 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Sun XT [Radeon HD 8670A/8670M/8690M / R5 M330]'
    class      = display

Thanks.
-M


> 
>Alex 
> 
>>> Which is hard to correlate without spending a lot more quality time with 
>>> the driver than I've had time for yet. 
>> 
>> 
>> Yeah, I don't see why some blocks should fail while others seem to 
>> initialize just fine. Especially since you reported it seems to work on 
>> other hardware. 
>> 
>>> One thing that occurs to me is that Linux is usually compiled with gcc6 - 
>>> has amdgpu ever been tested as compiled with clang? 
>> 
>> 
>> Not as far as I know. We had some problems in the past even with some gcc 
>> versions because of some odd things in the BIOS headers (e.g. zero sized 
>> arrays). But those issues should be fixed by now. 
>> 
>>> Below is a list of the warnings I have to disable in order to get amdgpu 
>>> to compile without disabling Werror altogether. The -Wno-format is an 
>>> artifact of clang or FreeBSD treating long long and uint64_t as distinct 
>>> types and the -Wno-pointer-arith is to accept the linux convention of doing 
>>> pointer arithmetic on void pointers. All the others are arguably oversights 
>>> in the code (similar silencing has to be done in i915, but I've had better 
>>> luck with it to date). I haven't fixed the warnings because I try to treat 
>>> it as vendor code and minimize any local changes. Will you accept 
>>> quasi-cosmetic patches from other operating systems / compilers? 
>> 
>> 
>> Yeah, sure feel free to provide patches. As long as it is only cleanup and 
>> not structural changes it should be trivial to get them merged. 
>> 
>> Especially "-Wno-missing-prototypes" and "-Wno-unused-variable" sound like 
>> something which should be trivial to fix. 
>> 
>> Regards, 
>> Christian. 
>> 
>> 
>>> 
>>> Thanks. 
>>> 
>>> -M 
>>> 
>>> 
>>> CWARNFLAGS+= -Wno-pointer-arith 
>>> CWARNFLAGS+= -Wno-pointer-sign ${CWARNFLAGS.${.IMPSRC:T}} 
>>> 
>>> CWARNFLAGS.amdgpu_acpi.c= -Wno-int-conversion 
>>> -Wno-missing-prototypes -Wno-unused-variable 
>>> CWARNFLAGS.amdgpu_amdkfd.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_bo_list.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_cs.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_device.c= -Wno-format -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_fence.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_gfx.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_amdkfd_gfx_v7.c= -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_amdkfd_gfx_v8.c= -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_atpx_handler.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_ih.c= -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_ioc32.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_object.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_mn.c= -Wno-unused-variable 
>>> CWARNFLAGS.amdgpu_pll.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_pm.c= -Wno-missing-prototypes 
>>> -Wno-enum-conversion 
>>> CWARNFLAGS.amdgpu_ring.c= -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_ttm.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.amdgpu_ucode.c= 
>>> -Wno-incompatible-pointer-types-discards-qualifiers -Wno-cast-qual 
>>> CWARNFLAGS.amdgpu_uvd.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_vce.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_vce.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_vm.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_test.c= -Wno-format 
>>> CWARNFLAGS.amdgpu_vm.c= -Wno-format 
>>> CWARNFLAGS.atombios_crtc.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.atombios_dp.c= -Wno-format 
>>> CWARNFLAGS.atombios_i2c.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.ci_dpm.c= -Wno-unused-const-variable 
>>> CWARNFLAGS.cz_smc.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.fiji_smc.c= -Wno-cast-qual 
>>> CWARNFLAGS.gfx_v7_0.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.gfx_v8_0.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.iceland_smc.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.kv_dpm.c= -Wno-unused-const-variable 
>>> CWARNFLAGS.tonga_smc.c= -Wno-cast-qual 
>>> CWARNFLAGS.gpu_scheduler.c= -Wno-format -Wno-missing-prototypes 
>>> CWARNFLAGS.amd_powerplay.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.eventtasks.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.cz_clockpowergating.c= -Wno-missing-prototypes 
>>> -Wno-enum-conversion 
>>> CWARNFLAGS.cz_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.fiji_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.fiji_thermal.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.pp_acpi.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.ppatomctrl.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.processpptables.c= -Wno-missing-prototypes 
>>> -Wno-sometimes-uninitialized 
>>> CWARNFLAGS.tonga_clockpowergating.c= -Wno-missing-prototypes 
>>> -Wno-enum-conversion 
>>> CWARNFLAGS.tonga_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.tonga_processpptables.c= -Wno-missing-prototypes 
>>> -Wno-cast-qual 
>>> CWARNFLAGS.tonga_thermal.c= -Wno-missing-prototypes 
>>> CWARNFLAGS.tonga_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> CWARNFLAGS.fiji_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> > 
>>> > Regards, 
>>> > Christian. 
>>> > 
>>> > Am 13.06.2016 um 03:35 schrieb Matthew Macy: 
>>> > > 
>>> > > I'm trying to bring up amdgpu an Carrizo A10 (Thinkpad e565 in case 
>>> it matters) on FreeBSD. The driver is essentially unmodified from what is 
>>> found in Linux 4.6 - relying on an extended version of FreeBSD's linuxkpi 
>>> shims. The shims work well enough that i915/drm from 4.6 works extremely 
>>> well on most hardware (I have yet to diagnose / fix the severe artifacts on 
>>> Cherry Trail and Atom). 
>>> > > 
>>> > > On my A10 ring 11 test is failing: 
>>> > > https://gist.github.com/mattmacy/8e4a85072648eceb2445ad227dcc447c 
>>> > > 
>>> > > On my friend's A12 based EliteBook ring initialization succeeds: 
>>> > > https://gist.github.com/mattmacy/d1fac64ab5190bb2568d6480dfbd7ee6 
>>> > > 
>>> > > With minor timing perturbations ring tests will fail as early as 
>>> ring 0. 
>>> > > 
>>> > > I'm hoping that one of the amdgpu developers might give me pointers 
>>> on how to diagnose further and or what bugs in the linuxkpi might be causing 
>>> this. I know that I can selectively disable the rings, but that doesn't help 
>>> fix the underlying problem. 
>>> > > 
>>> > > Thanks in advance. 
>>> > > 
>>> > > -M 
>>> > > 
>>> > > _______________________________________________ 
>>> > > dri-devel mailing list 
>>> > > dri-devel at lists.freedesktop.org 
>>> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel 
>>> > 
>>> > _______________________________________________ 
>>> > dri-devel mailing list 
>>> > dri-devel at lists.freedesktop.org 
>>> > https://lists.freedesktop.org/mailman/listinfo/dri-devel 
>>> > 
>>> 
>> 
>> _______________________________________________ 
>> dri-devel mailing list 
>> dri-devel at lists.freedesktop.org 
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel 
>
>



More information about the dri-devel mailing list