[Nouveau] Addressing the problem of noisy GPUs under Nouveau

John Hubbard jhubbard at nvidia.com
Mon Jan 29 07:51:39 UTC 2018


On 01/28/2018 04:05 PM, Martin Peres wrote:
> On 29/01/18 01:24, Martin Peres wrote:
>> On 28/11/17 07:32, John Hubbard wrote:
>>> On 11/23/2017 02:48 PM, Martin Peres wrote:
>>>> On 23/11/17 10:06, John Hubbard wrote:
>>>>> On 11/22/2017 05:07 PM, Martin Peres wrote:
>>>>>> Hey,
>>>>>>
>>>>>> Thanks for your answer, Andy!
>>>>>>
>>>>>> On 22/11/17 04:06, Ilia Mirkin wrote:
>>>>>>> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <aritger at nvidia.com> wrote:
>>>>>>> Martin's question was very long, but it boils down to this:
>>>>>>>
>>>>>>> How do we compute the correct values to write into the e114/e118 pwm
>>>>>>> registers based on the VBIOS contents and current state of the board
>>>>>>> (like temperature).
>>>>>>
>>>>>> Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
>>>>>> GF119+, or 0x200cd/d0 on Kepler+.
>>>>>>
>>>>>> At least, it looks like we know which PWM controler we need to drive, so
>>>>>> I did not want to muddy the water even more by giving register
>>>>>> addresses, rather concentrating on the problem at hand: How to compute
>>>>>> the duty value for the PWM controler.
>>>>>>
>>>>>>>
>>>>>>> We generally do this right, but appear to get it extra-wrong for certain GPUs.
>>>>>>
>>>>>> Yes... So far, we are always safe, but users tend to mind when their
>>>>>> computer sound like a jumbo jet at take off... Who would have thought? :D
>>>>>>
>>>>>> Anyway, looking forward to your answer!
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> One of our firmware engineers thinks that this looks a lot like PWM inversion.
>>>>> For some SKUs, the interpretation of the PWM duty cycle is inverted. That 
>>>>> would probably make it *very* difficult to find a sensible algorithm that 
>>>>> covered all the SKUs, given that some are inverted and others are not.
>>>>>
>>>>> For the noisy GPUs, a very useful experiment would be to try inverting it, 
>>>>> like this:
>>>>>
>>>>> 	pwmDutyCycle = pwmPeriod - pwmDutyCycle;
>>>>>
>>>>> ...and then see if fan control starts behaving closer to how you've actually 
>>>>> programmed it.
>>>>>
>>>>> Would that be easy enough to try out? It should help narrow down the
>>>>> problem at least.
>>>>>
>>>>
>>>> Hey John,
>>>>
>>>> Unfortunately, we know about PWM inversion, and one can know which mode
>>>> to use based on the GPIO entry associated to the fan (inverted). We have
>>>> had support for this in Nouveau for a long time. At the very least, this
>>>> is not the problem on my GF108.
>>>>
>>>> I am certain that the problem I am seeing is related to this vbios table
>>>> I wrote about (BIT P, offset 0x18). It is used to compute what PWM duty
>>>> I should use for both 0 and 100% of the fan speed.
>>>>
>>>> Computing the value for 0% fan speed is difficult because of
>>>> non-continuous nature of some of the functions[1], but I can always
>>>> over-approximate. However, I failed to accurately compute the duty I
>>>> need to write to get the 100% fan speed (I have cases where I greatly
>>>> over-estimate it...).
>>>>
>>>> Could you please check out the vbios table I am pointing at? I am quite
>>>> sure that your documentation will be clearer than my babbling :D
>>>
>>> Yes. We will check on this. There has been some productive discussion 
>>> internally, but it will take some more investigation.
>>>
>>> thanks,
>>> John Hubbard
>>
>> Have the productive discussions panned out?

Yes, we concluded our discussions, and decided that I should study the situation 
and write some documentation.  I just finished my research and writeup late last Friday, 
though, so my colleagues haven't had a chance to review it. Not to put undue
pressure on them, but I'm hoping that will go quickly now. The long pole is
done. :)

I was going to wait until the review was done, to respond, but I wanted to ACK 
this and to let you know that I do realize that the tables below are not directly 
answering your question.

(What happened here is: the new tables below are not actually what I've 
personally been working on; they just happen to be a very good set of supporting 
documentation in the exact same area. One of our teammates was already working 
on these independently, and managed to get them released.)

thanks,
-- 
John Hubbard

> 
> Oh, I see you pushed new vbios documentation:
>  1)
> http://download.nvidia.com/open-gpu-doc/BIOS-Information-Table/1/BIOS-Information-Table.html
>  2)
> http://download.nvidia.com/open-gpu-doc/MemoryClockTable/1/MemoryClockTable.html
>  3)
> http://download.nvidia.com/open-gpu-doc/MemoryTweakTable/1/MemoryTweakTable.html
> 
> Is there any chance to get the documentation of the "Thermal Coolers
> Table", and the "Thermal Device Table" (the latter does not seem super
> important though).
> 
> Anyway, thanks for the new documentation, reverse engineering of power
> management will be greatly simplified as we have a better idea which
> bits will control what. Too bad it won't help for the current issue
> though...
> 
> Thanks,
> Martin
> 


More information about the Nouveau mailing list