[OSADL QA 3.18.9-rt4 #1] Radeon driver hangs
Michel Dänzer
michel at daenzer.net
Tue Mar 24 23:57:34 PDT 2015
On 23.03.2015 07:14, Carsten Emde wrote:
> Hi Michel,
>
>>>>> [..]
>>>>> The most striking problem of kernel 3.18.9-rt4 affects all systems
>>>>> that
>>>>> are equipped with Radeon graphics (irrespective whether PCIe cards or
>>>>> APUs with on-chip graphics). They suffer from a hanging radeon driver.
>>>>> The block occurs when accelerated graphics load is created by
>>>>> x11perf or
>>>>> gltestperf. Sometimes only the graphics are frozen while ssh login
>>>>> still
>>>>> is possible, somtimes the entire box is no longer accessible at
>>>>> all. In
>>>>> any case, a reboot is needed to recover from this situation.
>>>>>
>>>>> Here is a selection of kernel messages:
>>>> [...]
>>>> The commits from
>>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f957063fee6392bb9365370db6db74dc0b2dce0a
>>>>
>>>>
>>>> to
>>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd9bb31cd35ab745d3b49005d10616d25bdc
>>>>
>>>>
>>>> and
>>>> http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b6610101718d4ab90d793c482625e98eb1262cad
>>>>
>>>>
>>>> might help for this.
>>>
>>> Thanks a lot. I have applied these patches to a number of systems:
>>> # quilt applied | tail -7
>>> patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch
>>> patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch
>>> patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch
>>> patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch
>>> patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch
>>> patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch
>>> patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
>>>
>>>
>>>
>>> The graphic boards still crash and freeze the screen, but in contrast
>>> to the earlier situation the systems remain accessible, and the X
>>> Window server can be restarted after the offensive programs are
>>> removed. The crashes were reliably triggered by
>>> - gltestperf
>>> or
>>> - x11perf -repeat 3 -subs 25 -time 2 -rect10
> This is not entirely correct, since gltestperf does not reliably crash
> the graphics controller. However, "x11perf -repeat 3 -subs 25 -time 2
> -rect10" always does a reliable job to trigger the crash.
>
>>> but the crashes also occur several times per day during normal work
>>> such as browsing the Internet or writing a text document. If you wish
>>> me to provide additional diagnostic information such as running test
>>> programs while the graphic boards are unresponsive, I certainly can do
>>> that.
>>
>> Does it also happen with a kernel built from a current drm-fixes tree?
>> http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
> No. Apparently, you need full preemption to expose the problem.
>
> The following list contains the results whether the command "x11perf
> -repeat 3 -subs 25 -time 2 -rect10" freezes the Radeon board under test
> (Radeon HD 7970 XFS / R9 280X) or not:
> linux-3.12.33-rt47 no
> linux-3.14.34-rt32 no
> linux-3.14.34-drm-3.16.7-rt32* no
> linux-3.18.7-rt1 YES
> linux-3.18.9-rt4 YES
> linux-3.18.9-rt5 YES
> linux-3.18.9-drm-3.16.7-rt5** no
> linux-4.0.0-rc4 no
> linux-drm-fixes no
> *DRM subsystem backported from linux-3.16.7 to linux-3.14.34-rt32.
> **DRM subsystem ported from linux-3.16.7 to linux-3.18.9-rt5.
Can you test a non-rt 3.18.y kernel? There were some intermittent issues
around 3.18 fixed by the patches I referenced above. Maybe I missed some
other fixes, though. Maarten, do you remember any other fixes offhand
that might help?
> More observations:
> If full function tracing is enabled (which makes the system about five
> times slower), the graphics controller no longer freezes. With partial
> function tracing such as "echo *drm* >set_ftrace_filter", the
> controller still freezes. The trace then contains vblank interrupt
> processing only, ioctls are no longer executed.
>
> This is the location where the driver hangs:
> [25104.509258] INFO: task Xorg.bin:16591 blocked for more than 120 seconds.
> [25104.516322] Not tainted 3.18.9-rt5 #2
> [25104.520715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25104.528853] Xorg.bin D ffffffff8171ed90 0 16591 16239
> 0x10400080
> [25104.536102] ffff8800ba0bb8d8 0000000000000002 ffff8800ba0bbfd8
> 0000000000000006
> [25104.536103] 000000000000dc08 ffff880626d0dc08 ffff8800ba0bbfd8
> 000000000000dc08
> [25104.536104] ffff88061b2cdcd0 ffff880616d3a940 ffff880035c10000
> ffff880616d3a940
> [25104.559274] Call Trace:
> [25104.561844] [<ffffffff8171bb54>] schedule+0x34/0xa0
> [25104.561846] [<ffffffff8171e2ac>] schedule_timeout+0x23c/0x2a0
> [25104.561870] [<ffffffffa00e3ab6>] ? radeon_fence_process+0x16/0x40
> [radeon]
> [25104.561879] [<ffffffffa00e3b24>] ?
> radeon_fence_any_seq_signaled+0x44/0x90 [radeon]
> [25104.561887] [<ffffffffa00e3e97>]
> radeon_fence_wait_seq_timeout.constprop.8+0x327/0x380 [radeon]
> [25104.561889] [<ffffffff810d19c0>] ? __wake_up_sync+0x20/0x20
> [25104.561898] [<ffffffffa00e4287>] radeon_fence_wait_any+0x57/0x70
> [radeon]
> [25104.561914] [<ffffffffa015a36f>] radeon_sa_bo_new+0x2af/0x4b0 [radeon]
> [25104.561916] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
> [25104.561918] [<ffffffff811d0b4a>] ? __kmalloc+0x8a/0x300
> [25104.561932] [<ffffffffa01b2197>] radeon_ib_get+0x37/0xe0 [radeon]
> [25104.561943] [<ffffffffa01003ee>] radeon_cs_ioctl+0x22e/0x860 [radeon]
> [25104.561952] [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm]
> [25104.561954] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
> [25104.561956] [<ffffffff810901ba>] ? unpin_current_cpu+0x1a/0x80
> [25104.561959] [<ffffffff810ba200>] ? migrate_enable+0x90/0x1a0
> [25104.561966] [<ffffffffa00c604c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
> [25104.561967] [<ffffffff811fdb88>] do_vfs_ioctl+0x2c8/0x4c0
> [25104.561969] [<ffffffff81208a92>] ? __fget+0x72/0xb0
> [25104.561970] [<ffffffff811fde01>] SyS_ioctl+0x81/0xa0
> [25104.561971] [<ffffffff8171f99e>] tracesys_phase2+0xd4/0xd9
>
> Conclusion:
> An upgrade change of the DRM subsystem between 3.16.7 and 3.18.9
> introduced a race condition that freezes Radeon graphics. It requires
> full preemption to be exposed reliably.
--
Earthling Michel Dänzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer
More information about the dri-devel
mailing list