[Mesa-dev] GPU (and system) monitoring

Gordon Haverland ghaverla at materialisations.com
Tue Nov 21 23:26:42 UTC 2017


On Tue, 21 Nov 2017 15:26:17 +0200
Eero Tamminen <eero.t.tamminen at intel.com> wrote:

> 1. Temperature is affected by the temperature of the surrounding
> media, power usage less so
> 2. Temperature sensor might not be exactly where current load is
> using most power (e.g. ALUs vs. memory)

I would imagine the reason to monitor temperature, is to look for
situations where the temperature is exceeding a limit.  It is not
necessary that the maximum temperature on the GPU is equal to the value
of the sensor, only that it is representative of that maximum.  We then
rephrase the maximum temperature of the GPU in terms of the maximum
observed at the temperature probe.

In any event, using umr I collected a bit over 900 data points when my
computer (FX-8320e and RX-460) was doing an Einstein at Home BOINX job
(Gamma Ray pulsar binary search, which is a process which requires 1
GPU and 1 CPU core both working together).  These jobs typically take
about 30 minutes on my computer, and 900 sample points is much less
than this.  It is possible that a run ended in the middle of my
sampling, I didn't look into that at the time).

There is a single temperature reading in the log (which is integer
degrees C), with a minimum of 58C and a maximum of 65C.  The average was
62.3C +/- 1.2C. The median temperature was 62C.

There are 3 sensor values (reported as integer centiWatts) that seem to
be "power": AvgGPU, MaxGPU and VDCC.

VDCC tends to be the lowest value.  Minimum seen was 2.63W, maximum was
64.27W.  Median was 30.985W.  Mean was 27 +/- 17W (mean and standard
deviation properly rounded).  The distribution has two modes.

Average GPU Power had a minimum of 15.31W and a maximum of 60.54W.
Median was 38.965W.  The mean was 39 +/- 14W (properly rounded).
Distribution again has 2 modes.

Maximum GPU Power tended to be the highest of the three.  Minimum seen
was 8.72W, maximum was 76.86W.  The median was 41.335W.  The mean was
38W +/- 18W (properly rounded).  Two modes were seen.

I also looked at the ratio of VDCC over Maximum GPU Power.  The minimum
was 0.05 and the maximum was 1.67.  The median was 0.69.  The mean
ratio was 0.66 +/- 0.35 (not properly rounded).  Most of the values are
in a mode about 0.8, but there is also a narrow peak at a ratio "close"
to 0.

The ratio of Average GPU Power over Maximum Power; minimum was 0.27,
maximum was 2.34.  Median was 1.00.  The mean was 1.00 +/- 0.34 (not
properly rounded).  This appears to be just a single distribution, with
a tail extending off to high values of the ratio.

The temperature at time t, should be a function of power.  But it is
not limited to the power at time t.  There will be some interval of
time which correlates power most strongly to the temperature.  There
will be some minimum time gap, which is related to the time it takes
for a power spike (delta function) at some location in the GPU, to
travel to the temperature probe.  Probably easier to analyse by doing a
step change in power, instead of trying to approximate a delta function.

Gord



More information about the mesa-dev mailing list