Looking for pointers on diagnosing ring test failure in amdgpu

Matthew Macy mmacy at nextbsd.org
Tue Jun 14 20:08:33 UTC 2016


 > 
 > The two issues are definitely related. They both go through a bounded delay loop waiting for some operation to complete.
 >

I realized that sounded really dumb after I sent it. But what makes me think it's all related is that timing perturbations / random seemingly unrelated code changes can cause it to fail in about 3 or 4 distinct ways. Some times failing as early ring 0, other times ring 1, other times ring 11, and in this case on SMU firmware check. Whereas no matter what I do it gets to the point of switching from the efifb to the fb based on the set up by amdgpu on my friend's elitebook. So it looks like I just hit really unfortunate choice for a bring up device.

-M


 > By default FreeBSD doesn't use the IOMMU on x86 so that's not an issue. 
 > 
 > One thing that is different between the Elitebook (A12) and the the Thinkpad (A10) is that the Thinkpad has both integrated and discrete GPUs. 0x6660 matches Hainan in drm_pciids.h which I guess is GCN 1.0? Could that possibly be an issue? I know amdgpu doesn't support pre GCN 1.1 currently, so I would assume it would just be ignored. Nonetheless, I thought I should bring it up just in case.
 > 
 > 
 > vgapci0 at pci0:0:1:0:    class=0x030000 card=0x511617aa chip=0x98741002 rev=0xc5 hdr=0x00
 >     vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
 >     device     = 'Carrizo'
 >     class      = display
 >     subclass   = VGA
 > <...>
 > vgapci1 at pci0:5:0:0:    class=0x038000 card=0x511617aa chip=0x66601002 rev=0x83 hdr=0x00
 >     vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
 >     device     = 'Sun XT [Radeon HD 8670A/8670M/8690M / R5 M330]'
 >     class      = display
 > 
 > Thanks.
 > -M
 > 
 > 
 > > 
 > >Alex 
 > > 
 > >>> Which is hard to correlate without spending a lot more quality time with 
 > >>> the driver than I've had time for yet. 
 > >> 
 > >> 
 > >> Yeah, I don't see why some blocks should fail while others seem to 
 > >> initialize just fine. Especially since you reported it seems to work on 
 > >> other hardware. 
 > >> 
 > >>> One thing that occurs to me is that Linux is usually compiled with gcc6 - 
 > >>> has amdgpu ever been tested as compiled with clang? 
 > >> 
 > >> 
 > >> Not as far as I know. We had some problems in the past even with some gcc 
 > >> versions because of some odd things in the BIOS headers (e.g. zero sized 
 > >> arrays). But those issues should be fixed by now. 
 > >> 
 > >>> Below is a list of the warnings I have to disable in order to get amdgpu 
 > >>> to compile without disabling Werror altogether. The -Wno-format is an 
 > >>> artifact of clang or FreeBSD treating long long and uint64_t as distinct 
 > >>> types and the -Wno-pointer-arith is to accept the linux convention of doing 
 > >>> pointer arithmetic on void pointers. All the others are arguably oversights 
 > >>> in the code (similar silencing has to be done in i915, but I've had better 
 > >>> luck with it to date). I haven't fixed the warnings because I try to treat 
 > >>> it as vendor code and minimize any local changes. Will you accept 
 > >>> quasi-cosmetic patches from other operating systems / compilers? 
 > >> 
 > >> 
 > >> Yeah, sure feel free to provide patches. As long as it is only cleanup and 
 > >> not structural changes it should be trivial to get them merged. 
 > >> 
 > >> Especially "-Wno-missing-prototypes" and "-Wno-unused-variable" sound like 
 > >> something which should be trivial to fix. 
 > >> 
 > >> Regards, 
 > >> Christian. 
 > >> 
 > >> 
 > >>> 
 > >>> Thanks. 
 > >>> 
 > >>> -M 
 > >>> 
 > >>> 
 > >>> CWARNFLAGS+= -Wno-pointer-arith 
 > >>> CWARNFLAGS+= -Wno-pointer-sign ${CWARNFLAGS.${.IMPSRC:T}} 
 > >>> 
 > >>> CWARNFLAGS.amdgpu_acpi.c= -Wno-int-conversion 
 > >>> -Wno-missing-prototypes -Wno-unused-variable 
 > >>> CWARNFLAGS.amdgpu_amdkfd.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_bo_list.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_cs.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_device.c= -Wno-format -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_fence.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_gfx.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_amdkfd_gfx_v7.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_amdkfd_gfx_v8.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_atpx_handler.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_ih.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_ioc32.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_object.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_mn.c= -Wno-unused-variable 
 > >>> CWARNFLAGS.amdgpu_pll.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_pm.c= -Wno-missing-prototypes 
 > >>> -Wno-enum-conversion 
 > >>> CWARNFLAGS.amdgpu_ring.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_ttm.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amdgpu_ucode.c= 
 > >>> -Wno-incompatible-pointer-types-discards-qualifiers -Wno-cast-qual 
 > >>> CWARNFLAGS.amdgpu_uvd.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_vce.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_vce.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_vm.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_test.c= -Wno-format 
 > >>> CWARNFLAGS.amdgpu_vm.c= -Wno-format 
 > >>> CWARNFLAGS.atombios_crtc.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.atombios_dp.c= -Wno-format 
 > >>> CWARNFLAGS.atombios_i2c.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.ci_dpm.c= -Wno-unused-const-variable 
 > >>> CWARNFLAGS.cz_smc.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.fiji_smc.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.gfx_v7_0.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.gfx_v8_0.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.iceland_smc.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.kv_dpm.c= -Wno-unused-const-variable 
 > >>> CWARNFLAGS.tonga_smc.c= -Wno-cast-qual 
 > >>> CWARNFLAGS.gpu_scheduler.c= -Wno-format -Wno-missing-prototypes 
 > >>> CWARNFLAGS.amd_powerplay.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.eventtasks.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.cz_clockpowergating.c= -Wno-missing-prototypes 
 > >>> -Wno-enum-conversion 
 > >>> CWARNFLAGS.cz_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.fiji_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.fiji_thermal.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.pp_acpi.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.ppatomctrl.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.processpptables.c= -Wno-missing-prototypes 
 > >>> -Wno-sometimes-uninitialized 
 > >>> CWARNFLAGS.tonga_clockpowergating.c= -Wno-missing-prototypes 
 > >>> -Wno-enum-conversion 
 > >>> CWARNFLAGS.tonga_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.tonga_processpptables.c= -Wno-missing-prototypes 
 > >>> -Wno-cast-qual 
 > >>> CWARNFLAGS.tonga_thermal.c= -Wno-missing-prototypes 
 > >>> CWARNFLAGS.tonga_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> CWARNFLAGS.fiji_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual 
 > >>> 
 > >>> 
 > >>> 
 > >>> 
 > >>> 
 > >>> > 
 > >>> > Regards, 
 > >>> > Christian. 
 > >>> > 
 > >>> > Am 13.06.2016 um 03:35 schrieb Matthew Macy: 
 > >>> > > 
 > >>> > > I'm trying to bring up amdgpu an Carrizo A10 (Thinkpad e565 in case 
 > >>> it matters) on FreeBSD. The driver is essentially unmodified from what is 
 > >>> found in Linux 4.6 - relying on an extended version of FreeBSD's linuxkpi 
 > >>> shims. The shims work well enough that i915/drm from 4.6 works extremely 
 > >>> well on most hardware (I have yet to diagnose / fix the severe artifacts on 
 > >>> Cherry Trail and Atom). 
 > >>> > > 
 > >>> > > On my A10 ring 11 test is failing: 
 > >>> > > https://gist.github.com/mattmacy/8e4a85072648eceb2445ad227dcc447c 
 > >>> > > 
 > >>> > > On my friend's A12 based EliteBook ring initialization succeeds: 
 > >>> > > https://gist.github.com/mattmacy/d1fac64ab5190bb2568d6480dfbd7ee6 
 > >>> > > 
 > >>> > > With minor timing perturbations ring tests will fail as early as 
 > >>> ring 0. 
 > >>> > > 
 > >>> > > I'm hoping that one of the amdgpu developers might give me pointers 
 > >>> on how to diagnose further and or what bugs in the linuxkpi might be causing 
 > >>> this. I know that I can selectively disable the rings, but that doesn't help 
 > >>> fix the underlying problem. 
 > >>> > > 
 > >>> > > Thanks in advance. 
 > >>> > > 
 > >>> > > -M 
 > >>> > > 
 > >>> > > _______________________________________________ 
 > >>> > > dri-devel mailing list 
 > >>> > > dri-devel at lists.freedesktop.org 
 > >>> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel 
 > >>> > 
 > >>> > _______________________________________________ 
 > >>> > dri-devel mailing list 
 > >>> > dri-devel at lists.freedesktop.org 
 > >>> > https://lists.freedesktop.org/mailman/listinfo/dri-devel 
 > >>> > 
 > >>> 
 > >> 
 > >> _______________________________________________ 
 > >> dri-devel mailing list 
 > >> dri-devel at lists.freedesktop.org 
 > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel 
 > >
 > >
 > 
 > _______________________________________________
 > dri-devel mailing list
 > dri-devel at lists.freedesktop.org
 > https://lists.freedesktop.org/mailman/listinfo/dri-devel
 > 



More information about the dri-devel mailing list