[Bug 69723] GPU lockups with kernel 3.11.0 / 3.12-rc1 when dpm=1 on r600g (Cayman)

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Dec 8 09:37:37 PST 2013


https://bugs.freedesktop.org/show_bug.cgi?id=69723

--- Comment #54 from Martin Andersson <g02maran at gmail.com> ---
I have a 6950 and I'm seeing the exact same things as Alexandre, random hangs
that completely lockup the machine. Can't ssh into it and nothing is printed to
the logs, the only thing that works is a power cycle. If I disable dpm the
machine is stable.

I have run with dpm since 3.11 and I have had the occasional lockup, maybe one
every two weeks. But I started playing some more games recently and noticed
that the lockups became much more frequent. So I decided to investigate.

The method I use to trigger a lockup is to run GpuTest in loop, with a 10
seconds sleep after each run. I do this to trigger power level switches. The
arguments to GpuTest is /test=plot3d /benchmark /benchmark_duration_ms=10000
/no_scorebox. At the same time I run piglit quick.tests in a loop, I later
found out that the piglit tests are not essential to get lockups but I kept
doing them for consistency. 20 of these tests have resulted in a lockup, of
these the longest running one lasted 80 minutes and shortest 3 minutes with an
average of 23 minutes. The tests that didn't cause lockups either had dpm
completely disabled or only certain features, which features are described
below. If I run GpuTest constantly, without the sleep and longer benchmark
duration, I don't get any lockups (I have done several long runs, with longest
being over six hours).

I also tried to find a good commit. I started with
7ad8d0687bb5030c3328bc7229a3183ce179ab25 (drm/radeon/dpm: re-enable state
transitions for Cayman) + the gcc fixes, but I get lockups on that commit as
well. I checked out 3.13-rc2 and started disabling features in ni_dpm_init. I
disabled the following things without any improvement. I reenabled each feature
after I had tested it and cold booted the machine.

eg_pi->smu_uvd_hs
pi->mvdd_control
eg_pi->vddci_control
pi->gfx_clock_gating
pi->mg_clock_gating
pi->mgcgtssm
pi->dynamic_pcie_gen2
pi->thermal_protection
pi->display_gap
pi->dcodt
pi->ulps
eg_pi->abm
eg_pi->mcls
eg_pi->light_sleep
eg_pi->memory_transition
ni_pi->cac_weights->enable_power_containment_by_default
ni_pi->use_power_boost_limit
pi->sclk_ss

eg_pi->pcie_performance_request, was already false so I didn't test it.

I noticed that pi->mvdd_control wasn't set, is that normal?

I don't get any lockups with pi->voltage_control disabled, but I also don't get
any power level switches.

If I set eg_pi->dynamic_ac_timing to false my machine lockups somewhere in the
boot process, I haven't looked into that any deeper.

However if I set pi->dynamic_ss to false the lockups disappear, it also works
with dynamic_ss set to true and pi->mclk_ss set to false.

So it seems, at least for me, it has something to do with mclk together with
power level switches. I'm not sure what to test next, but one thing might be to
try to remove the performance power level 2, so that it could only switch
between 0 and 1. But I haven't figured out how to accomplish that yet.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20131208/8d9d8d3f/attachment-0001.html>


More information about the dri-devel mailing list