[OSADL QA 3.18.9-rt4 #1] Radeon driver hangs

Carsten Emde C.Emde at osadl.org
Sun Mar 22 15:14:32 PDT 2015


Hi Michel,

>>>> [..]
>>>> The most striking problem of kernel 3.18.9-rt4 affects all systems that
>>>> are equipped with Radeon graphics (irrespective whether PCIe cards or
>>>> APUs with on-chip graphics). They suffer from a hanging radeon driver.
>>>> The block occurs when accelerated graphics load is created by x11perf or
>>>> gltestperf. Sometimes only the graphics are frozen while ssh login still
>>>> is possible, somtimes the entire box is no longer accessible at all. In
>>>> any case, a reboot is needed to recover from this situation.
>>>>
>>>> Here is a selection of kernel messages:
>>> [...]
>>> The commits from
>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f957063fee6392bb9365370db6db74dc0b2dce0a
>>>
>>> to
>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd9bb31cd35ab745d3b49005d10616d25bdc
>>>
>>> and
>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b6610101718d4ab90d793c482625e98eb1262cad
>>>
>>> might help for this.
>>
>> Thanks a lot. I have applied these patches to a number of systems:
>> # quilt applied | tail -7
>> patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch
>> patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch
>> patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch
>> patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch
>> patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch
>> patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch
>> patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
>>
>>
>>   The graphic boards still crash and freeze the screen, but in contrast
>> to the earlier situation the systems remain accessible, and the X
>> Window server can be restarted after the offensive programs are
>> removed. The crashes were reliably triggered by
>> - gltestperf
>>    or
>> - x11perf -repeat 3 -subs 25 -time 2 -rect10
This is not entirely correct, since gltestperf does not reliably crash
the graphics controller. However, "x11perf -repeat 3 -subs 25 -time 2
-rect10" always does a reliable job to trigger the crash.

>> but the crashes also occur several times per day during normal work
>> such as browsing the Internet or writing a text document. If you wish
>> me to provide additional diagnostic information such as running test
>> programs while the graphic boards are unresponsive, I certainly can do
>> that.
>
> Does it also happen with a kernel built from a current drm-fixes tree?
> http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
No. Apparently, you need full preemption to expose the problem.

The following list contains the results whether the command "x11perf
-repeat 3 -subs 25 -time 2 -rect10" freezes the Radeon board under test
(Radeon HD 7970 XFS / R9 280X) or not:
linux-3.12.33-rt47               no
linux-3.14.34-rt32               no
linux-3.14.34-drm-3.16.7-rt32*   no
linux-3.18.7-rt1                YES
linux-3.18.9-rt4                YES
linux-3.18.9-rt5                YES
linux-3.18.9-drm-3.16.7-rt5**    no
linux-4.0.0-rc4                  no
linux-drm-fixes                  no
*DRM subsystem backported from linux-3.16.7 to linux-3.14.34-rt32.
**DRM subsystem ported from linux-3.16.7 to linux-3.18.9-rt5.

More observations:
If full function tracing is enabled (which makes the system about five
times slower), the graphics controller no longer freezes. With partial
function tracing such as "echo *drm* >set_ftrace_filter", the
controller still freezes. The trace then contains vblank interrupt
processing only, ioctls are no longer executed.

This is the location where the driver hangs:
[25104.509258] INFO: task Xorg.bin:16591 blocked for more than 120 seconds.
[25104.516322]       Not tainted 3.18.9-rt5 #2
[25104.520715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[25104.528853] Xorg.bin        D ffffffff8171ed90     0 16591  16239 
0x10400080
[25104.536102]  ffff8800ba0bb8d8 0000000000000002 ffff8800ba0bbfd8 
0000000000000006
[25104.536103]  000000000000dc08 ffff880626d0dc08 ffff8800ba0bbfd8 
000000000000dc08
[25104.536104]  ffff88061b2cdcd0 ffff880616d3a940 ffff880035c10000 
ffff880616d3a940
[25104.559274] Call Trace:
[25104.561844]  [<ffffffff8171bb54>] schedule+0x34/0xa0
[25104.561846]  [<ffffffff8171e2ac>] schedule_timeout+0x23c/0x2a0
[25104.561870]  [<ffffffffa00e3ab6>] ? radeon_fence_process+0x16/0x40 
[radeon]
[25104.561879]  [<ffffffffa00e3b24>] ? 
radeon_fence_any_seq_signaled+0x44/0x90 [radeon]
[25104.561887]  [<ffffffffa00e3e97>] 
radeon_fence_wait_seq_timeout.constprop.8+0x327/0x380 [radeon]
[25104.561889]  [<ffffffff810d19c0>] ? __wake_up_sync+0x20/0x20
[25104.561898]  [<ffffffffa00e4287>] radeon_fence_wait_any+0x57/0x70 
[radeon]
[25104.561914]  [<ffffffffa015a36f>] radeon_sa_bo_new+0x2af/0x4b0 [radeon]
[25104.561916]  [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
[25104.561918]  [<ffffffff811d0b4a>] ? __kmalloc+0x8a/0x300
[25104.561932]  [<ffffffffa01b2197>] radeon_ib_get+0x37/0xe0 [radeon]
[25104.561943]  [<ffffffffa01003ee>] radeon_cs_ioctl+0x22e/0x860 [radeon]
[25104.561952]  [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm]
[25104.561954]  [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
[25104.561956]  [<ffffffff810901ba>] ? unpin_current_cpu+0x1a/0x80
[25104.561959]  [<ffffffff810ba200>] ? migrate_enable+0x90/0x1a0
[25104.561966]  [<ffffffffa00c604c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
[25104.561967]  [<ffffffff811fdb88>] do_vfs_ioctl+0x2c8/0x4c0
[25104.561969]  [<ffffffff81208a92>] ? __fget+0x72/0xb0
[25104.561970]  [<ffffffff811fde01>] SyS_ioctl+0x81/0xa0
[25104.561971]  [<ffffffff8171f99e>] tracesys_phase2+0xd4/0xd9

Conclusion:
An upgrade change of the DRM subsystem between 3.16.7 and 3.18.9
introduced a race condition that freezes Radeon graphics. It requires
full preemption to be exposed reliably.

Thanks,
	-Carsten.


More information about the dri-devel mailing list