[Intel-gfx] [PATCH v2] drm/i915: Keep the forcewake timer alive for 1ms past the most recent use

Mon May 22 08:03:06 UTC 2017

On 15/05/2017 13:35, Chris Wilson wrote:
> On Mon, May 15, 2017 at 01:06:47PM +0100, Tvrtko Ursulin wrote:
>>
>> On 15/05/2017 11:41, Chris Wilson wrote:
>>> On Mon, May 15, 2017 at 11:14:32AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 12/05/2017 23:16, Chris Wilson wrote:
>>>>> Currently the timer is armed for 1ms after the first use and is killed
>>>>> immediately, dropping the forcewake as early as possible. However, for
>>>>
>>>> Correct for implicit grabs, but for explicit it is 1-2ms after the last use.
>>>
>> >From the put of the first, we don't rearm the timer on later puts.
>>
>> What do you mean by first? I see the timer getting armed on the last put.
>
> Once the timer is armed, the next put will do nothing.
>
>>>>> very frequent operations the forcewake dance has a large impact on
>>>>> latency and keeping the timer alive until we are idle is preferred. To
>>>>
>>>> What workloads see the difference and by how much?
>>>
>>> The issue I have is that we can't submit nops fast enough using
>>
>> Fast enough for what? You mean just for your liking ie. we can be faster?
>
> We used to be faster (single context nop). And we are still not close to
> ringbuffer. :|
>
> To my liking, but looking at Vk traces they have lots of little batches
> (bordering on the nop) with perhaps one or two big ones. But we do hit
> the lite-restore repeated submission path many times per frame. (Still
> orders of magnitude less impact than gem_exec_nop! ;)
>
>>> lite-restore. A large part of that was from the rpm get/put in the
>>> tasklet, but I suspect ultimately it is the extra mmio/lite-restore that
>>> is slowing us down. Now, I'm happy that the lite-restore to keep port[1]
>>> accessible for the next context is a benefit, so I'm looking at how we
>>> can improve the continual resubmission.
>>
>> Ok.
>>
>>>> At the time I've fixed the auto-release to go from 0-1 jiffies to
>>>> 1-2ms, we talked about this conundrum - whether to consider the
>>>> first grab or last put for the timer. But we decided thorough
>>>> testing is needed to see if this would make a difference and what
>>>> power side effects it might have.
>>>
>>> In the above scenario, it never goes off so we are the paying the worst
>>> price of a useless dance. It's the periodic 1ms poll on an idle system
>>> that will suffer most, but in pratice this will delay turning off by an
>>> extra 1ms - and that may be the difference between 0.1% and 50% in power
>>> consumption :|
>>
>> What is a periodic 1ms poll on the idle system? If the system is
>> idle the auto-release timer will not be running.
>
> Otherwise idle, except for the hypothetical poll to keep arming the
> timer.
>
>> In the rapid short submission scenario I see the auto-release timer
>> potentially racing with the tasklet and needlessly dropping the fw.
>> But rapid short submission is so much faster than the 1-2ms
>> auto-release that this race must be very infrequent.
>>
>> I instead expect mostly to see the timer run and find the
>> domain->wake_count > 0 due a tasklet running in parallel who has
>> grabbed the fw for itself.
>>
>> Worse case scenario sounds like it would be some submission period
>> around the auto-release period but just shifted in "phase", no?
>>
>> So I do see some benefit, but would just want to see some numbers in
>> the commit message and a more precise description of the scenario it
>> improves.
>
> Would have to measure the 99.9th percentiles as we only affect those
> rare racing nops, so the average we report is unaffected - certainly it
> is not the 2us, on average, I'm trying to recover.

I've changed my mind over the weekend and started thinking we should 
have this even if the commit message did not fully explain everything.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko