[Bug 108625] AMDGPU - Can't even get Xorg to start - Kernel driver hangs with ring buffer timeout on ARM64

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Nov 1 15:59:10 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=108625

            Bug ID: 108625
           Summary: AMDGPU - Can't even get Xorg to start - Kernel driver
                    hangs  with ring buffer timeout on ARM64
           Product: DRI
           Version: unspecified
          Hardware: ARM
                OS: Linux (All)
            Status: NEW
          Severity: blocker
          Priority: medium
         Component: DRM/AMDgpu
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: raster at rasterman.com

So we're going to have fun with this one...

Start Xorg. It hangs in screen setup:

  #0  ioctl () at ../sysdeps/unix/sysv/linux/aarch64/ioctl.S:25
  #1  0x0000ffffbb149334 in drmIoctl () from /lib/aarch64-linux-gnu/libdrm.so.2
  #2  0x0000ffffba5166b4 in amdgpu_cs_query_fence_status () from
/lib/aarch64-linux-gnu/libdrm_amdgpu.so.1
  #3  0x0000ffffb9ef37f8 in ?? () from
/usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #4  0x0000ffffb9dd148c in ?? () from
/usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #5  0x0000ffffb993d448 in ?? () from
/usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #6  0x0000ffffb993d4ac in ?? () from
/usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #7  0x0000ffffba54425c in ?? () from
/usr/lib/xorg/modules/drivers/amdgpu_drv.so
  #8  0x0000ffffba537ca8 in ?? () from
/usr/lib/xorg/modules/drivers/amdgpu_drv.so
  #9  0x0000aaaae7133348 in MapWindow ()
  #10 0x0000aaaae710c820 in ?? ()
  #11 0x0000ffffbad52720 in __libc_start_main (main=0x0, argc=0, argv=0x0,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=<optimized out>) at ../csu/libc-start.c:310

And that ioctl hangs because of:

  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled
seq=10, last emitted seq=11
  [drm] GPU recovery disabled.

The amdgpu kernel driver reports:

  [drm] amdgpu kernel modesetting enabled.
  amdgpu 0000:89:00.0: enabling device (0100 -> 0102)
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_mc.bin
  amdgpu 0000:89:00.0: BAR 2: releasing [mem 0x14010000000-0x140101fffff 64bit
pref]
  amdgpu 0000:89:00.0: BAR 0: releasing [mem 0x14000000000-0x1400fffffff 64bit
pref]
  amdgpu 0000:89:00.0: BAR 0: assigned [mem 0x14000000000-0x140ffffffff 64bit
pref]
  amdgpu 0000:89:00.0: BAR 2: assigned [mem 0x14100000000-0x141001fffff 64bit
pref]
  amdgpu 0000:89:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF
(4096M used)
  amdgpu 0000:89:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
  [drm] amdgpu: 4096M of VRAM memory ready
  [drm] amdgpu: 4096M of GTT memory ready.
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_pfp_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_me_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_ce_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_rlc.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_mec_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_mec2_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_sdma.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_sdma1.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_uvd.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_vce.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware
amdgpu/polaris11_k_smc.bin
  [drm] Initialized amdgpu 3.26.0 20150101 for 0000:89:00.0 on minor 1
  amdgpu 0000:89:00.0: vgaarb: changed VGA decodes:
olddecodes=io+mem,decodes=none:owns=none

So here is where the fun begins. Kernel is:

  Linux noisy 4.18.0-2-arm64 #1 SMP Debian 4.18.10-2 (2018-10-07) aarch64
GNU/Linux

It's Debian unstable on a Cavium Thunder-X2 64bit ARM system (2 CPUs with 32
cores each, 256 cores total with 4 way SMT enabled) with a bunch of PCIE slots.
There is an Nvidia card that works.... to a decent degree and an on-board PCIE
dumb framebuffer display device (ASPEED), but I'd rather a more open stack etc.
- I've fiddled with xorg configs to get it to ignore other devices other than
the AMD one like with:

  Section "ServerFlags"
         Option "AutoAddGPU" "false"
  EndSection

  Section "Device"
         Identifier "amdgpu"
         Driver "amdgpu"
         BusID "PCI:137:0:0"
         Option "DRI" "2"
         Option "TearFree" "on"
  EndSection

I've even put the AMD card in the same slot as the Nvidia one with the same
results, so it's not a slot specific issue it seems. So where should I start
poking to see where this very early stage ring gfx timeout is originating from
specifically... I'm willing to start the fun of compiling kernels etc. to dig
through this. So how can I help solve this and make AMD cards portable and
usable? :)

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20181101/8716a4a0/attachment.html>


More information about the dri-devel mailing list