[Nouveau] Addressing the problem of noisy GPUs under Nouveau

Andy Ritger aritger at nvidia.com
Wed Nov 22 01:29:09 UTC 2017


Hi Martin,

I was asked to clarify a few things:

(1) Are all the user reports of loud fans on Fermi-era GPUs?

(2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon
processor (PMU), which will do basic fan management on its own.  We call this
init ucode "IFR" (Init From ROM).  nvidia.ko will restore the IFR ucode when
unloaded.  I assume the loud fan symptom occurs after Nouveau is loaded and
running, correct?  I.e., this is a problem in Nouveau's fan control
programming, rather than a problem in IFR.

(3) IFR will run until something else is loaded on the Falcon processor (PMU).
On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written
ucode from here:

    drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc

correct?  I only ask to rule out the possibility that IFR and Nouveau are both
attempting to program fans simultaneously.  The symptoms you describe don't
sound like that, but just double checking...

(4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi,
how much does Nouveau strictly depend on Nouveau's PMU ucode?  Would it be an
option to just let IFR continue to manage fans?

(5) Lastly, I was asked how Nouveau determines what fan speed to (attempt
to) program.

Thanks,
- Andy


On Sun, Nov 12, 2017 at 11:15:45PM -0800, John Hubbard wrote:
> On 11/12/2017 06:29 PM, Martin Peres wrote:
> > Hello,
> > 
> > Some users have been complaining for years about their GPU sounding like
> > a jet engine at take off. Last year, I finally laid my hand on one of
> > these GPUs and have been trying to fix this issue on and off since then.
> 
> Some early feedback: can you tell us the exact SKUs you have? And are these
> production boards with production VBIOSes?  
> 
> Normally, it's just our bringup boards that we'd expect to be noisy like 
> this, so we're looking for a few more details.
> 
> thanks,
> John Hubbard
> NVIDIA
> 
> > 
> > After failing to find anything in the HW, I figured out that the duty
> > cycle set by nvidia's proprietary driver would be way under the expected
> > value. By randomly changing values in the unknown tables of the vbios, I
> > found out that there is a fan calibration table at the offset 0x18 in
> > the BIT P table (version 2).
> > 
> > In this table, I identified 2 major 16 bits parameters at offset 0xa and
> > 0xc[2]. The first one, I named pwm_max, while naming the latter
> > pwm_offset. As expected, these parameters look like a mapping function
> > of the form aX + b. However, after gathering more samples, I found out
> > that the output was not continuous when linearly increasing pwm_offset
> > [1]. Even more funnily, the period of this square function is linear
> > with the frequency used for the fan's PWN.
> > 
> > I tried reverse engineering the formula to describe this function, but
> > failed to find a version that would work perfectly for all PWM
> > frequency. This is the closest I have got to[3], and I basically stopped
> > there about a year ago because I could not figure it out and got
> > frustrated :s.
> > 
> > I started again on this project 2 weeks ago, with the intent of finding
> > a good-enough solution for nouveau, and modelling the rest of the
> > equation that that would allow me to compute what duty I should set for
> > every wanted fan speed (%). I again mostly succeeded... but it would
> > seem that the interpretation of the table depends on the generation of
> > chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the
> > proprietary is not consistent for rules such as what to do when the
> > computed duty value is going to be lower than 0 or not (sometimes we
> > clamp it to 0, some times we set it to the same value as the divider,
> > some times we set it to a slightly lower value than the divider).
> > 
> > I have been trying to cover all edge cases by generating a randomized
> > set of values for the PWM frequency, pwm_max, and pwm_offset values,
> > flashed the vbios, and iterate from 0% to 100% fan speed while dumping
> > the values set by your driver. Using half a million sample points (which
> > took a week to acquire), my model computes 97% of the values correctly
> > (ignoring off by ones), while the remaining 3% are worryingly off (by up
> > to 100%)... It is clear that the code is not trivial and is full of
> > branching, which makes clean-room reverse engineering a chore.
> > 
> > As a final attempt to make a somewhat complete solution, I tried this
> > weekend to make a "safe" model that would still make the GPUs quiet. I
> > managed to improve the pass rate from 97 to 99.6%, but the remaining
> > failures conflict with my previous findings, which are also way more
> > prevalent. In the end, the only completely-safe way of driving the fan
> > is the current behaviour of nouveau...
> > 
> > At this point, I am ready to throw in the towel and hardcode parameters
> > in nouveau to address the problem of the loudest GPUs, but this is of
> > course suboptimal. This is why I am asking for your help. Would you have
> > some documentation about this fan calibration table that could help me
> > here? Code would be even more appreciated.
> > 
> > Thanks a lot in advance,
> > Martin
> > 
> > PS: here is most of the code you may want to see:
> > http://fs.mupuf.org/nvidia/fan_calib/
> > 
> > [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
> > [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333
> > [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
> > 


More information about the Nouveau mailing list