nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

Konrad Rzeszutek Wilk konrad.wilk at oracle.com
Mon Mar 4 13:41:10 PST 2013


On Mon, Mar 04, 2013 at 08:21:48PM +0100, Martin Peres wrote:
> Hi Konrad,
> 
> On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote:> After git merge
> ab7826595e9ec51a51f622c5fc91e2f59440481a
> > (Merge tag 'mfd-3.9-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
> > the nouveau driver ends up shutting of the machine when booting.
> >
> >
> > I hadn't done a git bisection yet and was wondering if there are some
> > juice commits I ought to look at?
> 
> Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.
> 
> The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/
> 
> 
> >
> > Here is the serial console:
> 
> 
> > [    6.940628] nouveau  [  PTHERM][0000:00:0d.0] Thermal
> management: disabled
> > [    6.957474] nouveau  [  PTHERM][0000:00:0d.0] programmed
> thresholds [ 90(2), 95(3), 145(2), 135(5) ]
> > [    6.966594] nouveau     6.975100] nouveau  [
> PTHERM][0000:00:0d.0] Thermal management: automatic
> > [    6.982059] nouveau  [  PTHERM][0000:00:0d.0] temperature (88
> C) hit the 'downclock' threshold
> > [    6.990680] nouveau  [  PTHERM][0000:00:0d.0] temperature (88
> C) hit the 'critical' threshold
> > [    6.999194] nouveau  [  PTHERM][0000:00:0d.0] temperature (90
> C) hit the 'shutdown' threshold
> 
> See, this is strange. If I believe the "programmed thresholds" line,
> the fanboost threshold is at 90°C, downclock is at 95°C, critical
> temperature is at 145°C and shutdown is at 135°C.
> So, from the BIOS side, things seem to be in fairly good shape
> (critical should be lower than shutdown, but that's OK).
> 
> My theory is that your temperature sensor is very variable that
> would set off the shutdown alarm. So, either the sensor needs more
> settling time or the output is genuinely very variable.

You should see it when I boot it under Xen:

[    8.427789] nouveau  [  PTHERM][0000:00:0d.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]^M^M
[    8.427855] nouveau  [  PTHERM][0000:00:0d.0] temperature (222 C) hit the 'fanboost' threshold^M^M
[    8.427919] nouveau  [  PTHERM][0000:00:0d.0] Thermal management: automatic^M^M
[    8.427973] nouveau  [  PTHERM][0000:00:0d.0] temperature (222 C) hit the 'downclock' threshold^M^M
[    8.428036] nouveau  [  PTHERM][0000:00:0d.0] temperature (222 C) hit the 'critical' threshold^M^M
[    8.428099] nouveau  [  PTHERM][0000:00:0d.0] temperature (222 C) hit the 'shutdown' threshold^M^M

> 
> In the first case, we could fix that by increasing the settling time
> (at the expense of a longer boot period). We could also for a 10s
> wait at boot time before reading temperature.
> If this is the latter case, we only have the solution to average the
> temperature on several samples. I would need statistics on the
> variability in order to calculate a proper low-pass filter that
> wouldn't be too slow or too RAM/wakeup-intensive.
> 
> I really hope the problem is the settling time!
> 
> 
> Here is what you can do to test the theory:
> 
> Change the mdelay at line 41 of
> /drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c (http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41)
> from 10 to 1000.
> Please also add an mdelay of 1000 between lines 44 and 45.

Let me do that tomorrow and report my findings.
> 
> If it works with this patch, then try decreasing the delay to 20ms.
> 
> In any way, I'll send some thermal patches tonight to be more
> resistant to long settling times.

Pls CC me in case you would like me also to test them with the
mdelay patch.

> 
> Thanks for reporting!

Of course.
> 
> Martin (mupuf)
> 
> 


More information about the dri-devel mailing list