[Intel-gfx] [PATCH 1/3] drm/i915: Use consistent forcewake auto-release timeout across kernel configs

Tue Apr 5 10:44:51 UTC 2016

On Tue, Apr 05, 2016 at 11:02:15AM +0100, Tvrtko Ursulin wrote:
> 
> On 05/04/16 09:59, Chris Wilson wrote:
> >On Tue, Apr 05, 2016 at 09:54:58AM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 04/04/16 19:58, Chris Wilson wrote:
> >>>On Mon, Apr 04, 2016 at 05:51:09PM +0100, Tvrtko Ursulin wrote:
> >>>>From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>
> >>>>Current implementation releases the forcewake at any time between
> >>>>straight away, and one jiffie from the last put, or first automatic
> >>>>grab.
> >>>
> >>>That isn't the problem though. The problem is that we set the timer on
> >>>first use rather than last use. All you are stating here is that by
> >>>lengthening the timeout on your system you reduce the number of times it
> >>>times out.
> >>
> >>It is true the reduction I see is due lengthening of the average timeout.
> >>
> >>But with regards to re-arming approach, I thought so initially
> >>myself, but then, if we expect bursty access then it shouldn't
> >>matter and the simplicity of doing it like it currently is better.
> >>
> >>Even in practice, I noticed the effect of lengthening the timeout is
> >>much greater than moving the arming to the last access. And to get
> >>to very few to none auto releases on busy workloads we need in the
> >>regions of 5ms, which would be a big change. Or maybe not if you
> >>consider HZ=100 kernels.
> >>
> >>It is very difficult to know what is actually better considering
> >>power between the CPU and GPU and performance. So I though the best
> >>would be to keep it similar to the current timings, just fix the
> >>dependency on HZ and also slack with hrtimers might help with
> >>something.
> >>
> >>As a final data point, explicit puts and auto releases seems to be
> >>relatively balanced in my testing. With this patch T-Rex for example
> >>auto-releases in the region of 3-4 times in 10ms, with around 5-10
> >>explicit gets, and 5-10 implicit gets in 10ms.
> >>
> >>A different, interrupt heavy workload (~20k irqs/sec) manages
> >>similarly 2-4 auto-releases per 10ms, and has ~250 explicit gets and
> >>~380 implicit per 10ms.
> >>
> >>Looking at the two I think the fact they manage to auto-release
> >>relatively similarly, compared to a huge different in fw gets,
> >>suggest burstyness. So I am not sure that any smarts with the
> >>release period would be interesting. At least not without serious
> >>power/perf measurements.
> >>
> >>>Having said that, the conversion to hrtimer seems sensible but to add
> >>>tracking of the last access as opposed to first we either fallback to
> >>>jiffie (in which case hrtimer is moot) or rely on ktime_get_raw() being
> >>>fast enough for every register write. Hmm, my usual response to that has
> >>>been if it matters we avoid the heavyweight macros and use the _FW
> >>>interface - or even raw readl/writel.
> >>>
> >>>Could you try storing ktime_get_raw() on every access and rearming the
> >>>timer if it expires before last-access + min-period?
> >>
> >>That would be similar to your patch from before my holiday, right?
> >>As I said above, I did not notice much change with that approach.
> >>Just extending the timeout has a much greater effect, but as I wrote
> >>above, I am not sure we want to change it.
> >
> >There was just one bug in that patch in checking the expiration that
> >makes a huge difference, but if you please add the discussion above to
> >the commit log that would be invaluable.
> 
> I think I had a fixed version, would have to dig in branches now.
> 
> I will add more text to the commit.
> 
> Most importantly, do you think this would affect HZ=10 or HZ=250
> kernels at all? Presumably if it could, someone would know today
> that there is a difference between the kernels based on HZ already?

Hmm, I switch quite regularly - that may explain why sometimes I see
fw_domains_get quite heavy in the profiles and other. But I have never
correlated that with changing .configs. And I don't keep stats for rc6
during stress tests - especially more notable is the lack of light
desktop load testing.

One interesting side-effect of htrimer is that iirc it is installed on
the local cpu and isn't run until we hit a preemption point (i.e. the
scheduler in my case) so for those who are running high latency low HZ
are likely unaffected at all!

As far as hrtimer is concerned, my only real worry is that if it is too
expensive - but since it is being increasingly widely used I shouldn't
worry myself too much about that (if it were to be a problem, it will
get fixed!).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre