[Intel-gfx] [PATCH v3] drm/i915: Use exponential backoff for wait_for()
John Harrison
John.C.Harrison at Intel.com
Thu Nov 30 07:15:48 UTC 2017
On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 8:34 AM, John Harrison wrote:
>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>
>>>> (SKL)
>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>
>>>> drm-tip:
>>>> usecs : count distribution
>>>> 0 -> 1 : 0 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 0 | |
>>>> 8 -> 15 : 44 | |
>>>> 16 -> 31 : 1088 | |
>>>> 32 -> 63 : 832 | |
>>>> 64 -> 127 : 0 | |
>>>> 128 -> 255 : 0 | |
>>>> 256 -> 511 : 12 | |
>>>> 512 -> 1023 : 0 | |
>>>> 1024 -> 2047 : 29899 |********* |
>>>> 2048 -> 4095 : 131033 |****************************************|
>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>> could create a tracepoint/kprobe that would output a histogram for each
>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>> spin/sleep, in which case overall metrics are best
>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>> have the 10us sleep.
>>>
>>> We would also need a typical workload to run, it's profile-guided
>>> optimisation after all. Hmm.
>>> -Chris
>>
>> It took me a while to get back to this but I've now had chance to run
>> with this exponential backoff scheme on the original system that
>> showed the problem. It was a slightly messy back port due to the
>> customer tree being much older than current nightly. I'm pretty sure
>> I got it correct though. However, I'm not sure what the
>> recommendation is for the two timeout values. Using the default of
>> '10, 10' in the patch, I still get lots of very long delays.
> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and
> Wmin=10, Wmax=1000 for wait_for.
>
> Exponential backoff is more helpful inside wait_for if wait_for_us
> prior to wait_for is smaller.
> Setting Wmax less than Wmin is effectively changing the backoff
> strategy to just linear waits of Wmin.
>> I have to up the Wmin value to at least 140 to get a stall free
>> result. Which is plausible given that the big spike in the results of
>> any fast version is at 110-150us. Also of note is that a Wmin between
>> 10 and 110 actually makes things worse. Changing Wmax has no effect.
>>
>> In the following table, 'original' is the original driver before any
>> changes and 'retry loop' is the version using the first workaround of
>> just running the busy poll wait in a 10x loop. The other columns are
>> using the backoff patch with the given Wmin/Wmax values. Note that
>> the times are bucketed to 10us up to 500us and then in 500us lumps
>> thereafter. The value listed is the lower limit, i.e. there were no
>> times of <10us measured. Each case was run for 1000 samples.
>>
> Below setting like in current nightly will suit this workload and as
> you have found this will also likely complete most waits in <150us.
> If many samples had been beyond 160us and less than 300us we might
> have been needed to change Wmin to may be 15 or 20 to ensure the
> exponential rise caps around 300us.
>
> wait_for_us(10, 10)
> wait_for()
>
> #define wait_for _wait_for(10, 1000)
>
But as shown in the table, a setting of 10/10 does not work well for
this workload. The best results possible are a large spike of waits in
the 120-130us bucket with a small tail out to 150us. Whereas, the 10/10
setting produces a spike from 150-170us with the tail extending to 240us
and an appreciable number of samples stretching all the way out to the
1-10ms range. A regular delay of multiple milliseconds is not acceptable
when this path is supposed to be a low latency pre-emption to switch to
some super high priority time critical task. And as noted, I did try a
bunch of different settings for Wmax but nothing seemed to make much of
a difference. E.g. 10/10 vs 10/1000 produced pretty much identical
results. Hence it didn't seem worth including those in the table.
>> Time Original 10/10 50/10 100/10 110/10
>> 130/10 140/10 RetryLoop
>> 10us: 2 2 2 2 2
>> 2 2 2
>> 30us: 1 1 1 1 1
>> 50us: 1
>> 70us: 14 63 56
>> 64 63 61
>> 80us: 8 41 52
>> 44 46 41
>> 90us: 6 24 10
>> 28 12 17
>> 100us: 2 4 20 16
>> 17 17 22
>> 110us: 13 21
>> 14 13 11
>> 120us: 6 366 633
>> 636 660 650
>> 130us: 2 2 46 125
>> 95 86 95
>> 140us: 3 2 16 18
>> 32 46 48
>> 150us: 210 3 12 13
>> 37 32 31
>> 160us: 322 1 18 10
>> 14 12 17
>> 170us: 157 4 5 5
>> 3 5 2
>> 180us: 62 11 3 1
>> 2 1 1
>> 190us: 32 212 1 1 2
>> 200us: 27 266 1 1
>> 210us: 16
>> 181 1
>> 220us: 16 51 1
>> 230us: 10 43 4
>> 240us: 12 22 62 1
>> 250us: 4 12 112 3
>> 260us: 3 13 73 8
>> 270us: 5 12 12 8 2
>> 280us: 4 7 12 5 1
>> 290us: 9 4
>> 300us: 1 3 9 1 1
>> 310us: 2 3 5 1 1
>> 320us: 1 4 2 3
>> 330us: 1 5 1
>> 340us: 1 2 1
>> 350us: 2 1
>> 360us: 2 1
>> 370us: 2 2
>> 380us: 1
>> 390us: 2 1 2 1
>> 410us: 1
>> 420us: 3
>> 430us: 2 2 1
>> 440us: 2 1
>> 450us: 4
>> 460us: 3 1
>> 470us: 3 1
>> 480us: 2 2
>> 490us: 1
>> 500us: 19 13 17
>> 1000us: 249 22 30 11
>> 1500us: 393 4 4 2 1
>> 2000us: 132 7 8 8 2
>> 1 1
>> 2500us: 63 4 4 6 1 1 1
>> 3000us: 59 9 7 6 1
>> 3500us: 34 2 1 1
>> 4000us: 17 9 4 1
>> 4500us: 8 2 1 1
>> 5000us: 7 1 2
>> 5500us: 7 2 1
>> 6000us: 4 2 1 1
>> 6500us: 3 1
>> 7000us: 6 2 1
>> 7500us: 4 1 1
>> 8000us: 5 1
>> 8500us: 1 1
>> 9000us: 2
>> 9500us: 2 1
>> >10000us: 3 1
>>
>>
>> John.
>>
>>
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20171129/eb984137/attachment-0001.html>
More information about the Intel-gfx
mailing list