[Nouveau] Addressing the problem of noisy GPUs under Nouveau

John Hubbard jhubbard at nvidia.com
Mon Nov 13 03:12:43 UTC 2017


Hi Martin,

This is just a quick ACK. I've started an internal email thread and
we'll see if we can get back to you soon.

Yes, our thermal and fan control definitely changes a lot which
the various chip architectures. I'm continually impressed by how much
the SW+HW has been able to improve performance per watt, year after
year, but of course the side effect is a very complex system, as 
you are seeing. But even so, let's see if there is any sort of
simpler approximation that would work for you here...no promises,
because I'm about to be humbled when the thermal experts respond. :)


thanks,
John Hubbard
NVIDIA

On 11/12/2017 06:29 PM, Martin Peres wrote:
> Hello,
> 
> Some users have been complaining for years about their GPU sounding like
> a jet engine at take off. Last year, I finally laid my hand on one of
> these GPUs and have been trying to fix this issue on and off since then.
> 
> After failing to find anything in the HW, I figured out that the duty
> cycle set by nvidia's proprietary driver would be way under the expected
> value. By randomly changing values in the unknown tables of the vbios, I
> found out that there is a fan calibration table at the offset 0x18 in
> the BIT P table (version 2).
> 
> In this table, I identified 2 major 16 bits parameters at offset 0xa and
> 0xc[2]. The first one, I named pwm_max, while naming the latter
> pwm_offset. As expected, these parameters look like a mapping function
> of the form aX + b. However, after gathering more samples, I found out
> that the output was not continuous when linearly increasing pwm_offset
> [1]. Even more funnily, the period of this square function is linear
> with the frequency used for the fan's PWN.
> 
> I tried reverse engineering the formula to describe this function, but
> failed to find a version that would work perfectly for all PWM
> frequency. This is the closest I have got to[3], and I basically stopped
> there about a year ago because I could not figure it out and got
> frustrated :s.
> 
> I started again on this project 2 weeks ago, with the intent of finding
> a good-enough solution for nouveau, and modelling the rest of the
> equation that that would allow me to compute what duty I should set for
> every wanted fan speed (%). I again mostly succeeded... but it would
> seem that the interpretation of the table depends on the generation of
> chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the
> proprietary is not consistent for rules such as what to do when the
> computed duty value is going to be lower than 0 or not (sometimes we
> clamp it to 0, some times we set it to the same value as the divider,
> some times we set it to a slightly lower value than the divider).
> 
> I have been trying to cover all edge cases by generating a randomized
> set of values for the PWM frequency, pwm_max, and pwm_offset values,
> flashed the vbios, and iterate from 0% to 100% fan speed while dumping
> the values set by your driver. Using half a million sample points (which
> took a week to acquire), my model computes 97% of the values correctly
> (ignoring off by ones), while the remaining 3% are worryingly off (by up
> to 100%)... It is clear that the code is not trivial and is full of
> branching, which makes clean-room reverse engineering a chore.
> 
> As a final attempt to make a somewhat complete solution, I tried this
> weekend to make a "safe" model that would still make the GPUs quiet. I
> managed to improve the pass rate from 97 to 99.6%, but the remaining
> failures conflict with my previous findings, which are also way more
> prevalent. In the end, the only completely-safe way of driving the fan
> is the current behaviour of nouveau...
> 
> At this point, I am ready to throw in the towel and hardcode parameters
> in nouveau to address the problem of the loudest GPUs, but this is of
> course suboptimal. This is why I am asking for your help. Would you have
> some documentation about this fan calibration table that could help me
> here? Code would be even more appreciated.
> 
> Thanks a lot in advance,
> Martin
> 
> PS: here is most of the code you may want to see:
> http://fs.mupuf.org/nvidia/fan_calib/
> 
> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
> [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333
> [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
> 


More information about the Nouveau mailing list