Looking for pointers on diagnosing ring test failure in amdgpu
Christian König
christian.koenig at amd.com
Tue Jun 14 08:10:04 UTC 2016
Hi Matthew,
see inline below.
Am 14.06.2016 um 00:03 schrieb Matthew Macy:
> ---- On Mon, 13 Jun 2016 01:35:34 -0700 Christian König <christian.koenig at amd.com> wrote ----
> > Hi Matthew,
> >
> > sounds like the UVD block doesn't want to initialize. No idea off hand
> > why, could be anything. I would need the hardware here for a closer
> > inspection.
> >
> > For a workaround you can try to disable the UVD blokc using the
> > ip_block_mask module parameter (it's a bitmask of enabled blocks e.g.
> > 0xffffffff means all blocks enabled, UVD is bit 7 on Carrizo IIRC).
>
>
> When I clear bit 7 I get the following now:
>
> Jun 14 07:58:18 trainwreck kernel: drmn0: fence driver on ring 10 use gpu addr 0x00000000400000b0, cpu addr 0x0xfffff800bd4320b0
> Jun 14 07:58:18 trainwreck kernel: drmn0: fence driver on ring 11 use gpu addr 0x00000000400000c0, cpu addr 0x0xfffff800bd4320c0
> Jun 14 07:58:19 trainwreck kernel: drmn0: SMU check loaded firmware failed, expecting 0x17f, getting 0x0[drm:0xffffffff826d4dc4s] *ERROR* amdgpu: smc start failed
> Jun 14 07:58:19 trainwreck kernel: [drm:0xffffffff8269fc40s] *ERROR* hw_init 3 failed -22
> Jun 14 07:58:19 trainwreck kernel: drmn0: amdgpu_init failed
UVD is optional (as long as you don't want to do hardware video
decoding) but the SMU isn't. Alex, Rex any idea what's going wrong here?
> Which is hard to correlate without spending a lot more quality time with the driver than I've had time for yet.
Yeah, I don't see why some blocks should fail while others seem to
initialize just fine. Especially since you reported it seems to work on
other hardware.
> One thing that occurs to me is that Linux is usually compiled with gcc6 - has amdgpu ever been tested as compiled with clang?
Not as far as I know. We had some problems in the past even with some
gcc versions because of some odd things in the BIOS headers (e.g. zero
sized arrays). But those issues should be fixed by now.
> Below is a list of the warnings I have to disable in order to get amdgpu to compile without disabling Werror altogether. The -Wno-format is an artifact of clang or FreeBSD treating long long and uint64_t as distinct types and the -Wno-pointer-arith is to accept the linux convention of doing pointer arithmetic on void pointers. All the others are arguably oversights in the code (similar silencing has to be done in i915, but I've had better luck with it to date). I haven't fixed the warnings because I try to treat it as vendor code and minimize any local changes. Will you accept quasi-cosmetic patches from other operating systems / compilers?
Yeah, sure feel free to provide patches. As long as it is only cleanup
and not structural changes it should be trivial to get them merged.
Especially "-Wno-missing-prototypes" and "-Wno-unused-variable" sound
like something which should be trivial to fix.
Regards,
Christian.
>
> Thanks.
>
> -M
>
>
> CWARNFLAGS+= -Wno-pointer-arith
> CWARNFLAGS+= -Wno-pointer-sign ${CWARNFLAGS.${.IMPSRC:T}}
>
> CWARNFLAGS.amdgpu_acpi.c= -Wno-int-conversion -Wno-missing-prototypes -Wno-unused-variable
> CWARNFLAGS.amdgpu_amdkfd.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_bo_list.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_cs.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_device.c= -Wno-format -Wno-cast-qual
> CWARNFLAGS.amdgpu_fence.c= -Wno-format
> CWARNFLAGS.amdgpu_gfx.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_amdkfd_gfx_v7.c= -Wno-cast-qual
> CWARNFLAGS.amdgpu_amdkfd_gfx_v8.c= -Wno-cast-qual
> CWARNFLAGS.amdgpu_atpx_handler.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_ih.c= -Wno-cast-qual
> CWARNFLAGS.amdgpu_ioc32.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_object.c= -Wno-format
> CWARNFLAGS.amdgpu_mn.c= -Wno-unused-variable
> CWARNFLAGS.amdgpu_pll.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_pm.c= -Wno-missing-prototypes -Wno-enum-conversion
> CWARNFLAGS.amdgpu_ring.c= -Wno-cast-qual
> CWARNFLAGS.amdgpu_ttm.c= -Wno-missing-prototypes
> CWARNFLAGS.amdgpu_ucode.c= -Wno-incompatible-pointer-types-discards-qualifiers -Wno-cast-qual
> CWARNFLAGS.amdgpu_uvd.c= -Wno-format
> CWARNFLAGS.amdgpu_vce.c= -Wno-format
> CWARNFLAGS.amdgpu_vce.c= -Wno-format
> CWARNFLAGS.amdgpu_vm.c= -Wno-format
> CWARNFLAGS.amdgpu_test.c= -Wno-format
> CWARNFLAGS.amdgpu_vm.c= -Wno-format
> CWARNFLAGS.atombios_crtc.c= -Wno-missing-prototypes
> CWARNFLAGS.atombios_dp.c= -Wno-format
> CWARNFLAGS.atombios_i2c.c= -Wno-missing-prototypes
> CWARNFLAGS.ci_dpm.c= -Wno-unused-const-variable
> CWARNFLAGS.cz_smc.c= -Wno-missing-prototypes
> CWARNFLAGS.fiji_smc.c= -Wno-cast-qual
> CWARNFLAGS.gfx_v7_0.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.gfx_v8_0.c= -Wno-missing-prototypes
> CWARNFLAGS.iceland_smc.c= -Wno-missing-prototypes
> CWARNFLAGS.kv_dpm.c= -Wno-unused-const-variable
> CWARNFLAGS.tonga_smc.c= -Wno-cast-qual
> CWARNFLAGS.gpu_scheduler.c= -Wno-format -Wno-missing-prototypes
> CWARNFLAGS.amd_powerplay.c= -Wno-missing-prototypes
> CWARNFLAGS.eventtasks.c= -Wno-missing-prototypes
> CWARNFLAGS.cz_clockpowergating.c= -Wno-missing-prototypes -Wno-enum-conversion
> CWARNFLAGS.cz_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.fiji_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.fiji_thermal.c= -Wno-missing-prototypes
> CWARNFLAGS.pp_acpi.c= -Wno-missing-prototypes
> CWARNFLAGS.ppatomctrl.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.processpptables.c= -Wno-missing-prototypes -Wno-sometimes-uninitialized
> CWARNFLAGS.tonga_clockpowergating.c= -Wno-missing-prototypes -Wno-enum-conversion
> CWARNFLAGS.tonga_hwmgr.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.tonga_processpptables.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.tonga_thermal.c= -Wno-missing-prototypes
> CWARNFLAGS.tonga_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual
> CWARNFLAGS.fiji_smumgr.c= -Wno-missing-prototypes -Wno-cast-qual
>
>
>
>
>
> >
> > Regards,
> > Christian.
> >
> > Am 13.06.2016 um 03:35 schrieb Matthew Macy:
> > >
> > > I'm trying to bring up amdgpu an Carrizo A10 (Thinkpad e565 in case it matters) on FreeBSD. The driver is essentially unmodified from what is found in Linux 4.6 - relying on an extended version of FreeBSD's linuxkpi shims. The shims work well enough that i915/drm from 4.6 works extremely well on most hardware (I have yet to diagnose / fix the severe artifacts on Cherry Trail and Atom).
> > >
> > > On my A10 ring 11 test is failing:
> > > https://gist.github.com/mattmacy/8e4a85072648eceb2445ad227dcc447c
> > >
> > > On my friend's A12 based EliteBook ring initialization succeeds:
> > > https://gist.github.com/mattmacy/d1fac64ab5190bb2568d6480dfbd7ee6
> > >
> > > With minor timing perturbations ring tests will fail as early as ring 0.
> > >
> > > I'm hoping that one of the amdgpu developers might give me pointers on how to diagnose further and or what bugs in the linuxkpi might be causing this. I know that I can selectively disable the rings, but that doesn't help fix the underlying problem.
> > >
> > > Thanks in advance.
> > >
> > > -M
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>
More information about the dri-devel
mailing list