[Intel-gfx] [PATCH v3] drm/i915: Use exponential backoff for wait_for()
John Harrison
John.C.Harrison at Intel.com
Mon Dec 4 21:51:54 UTC 2017
On 11/29/2017 11:55 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 12:45 PM, John Harrison wrote:
>> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>>
>>>>>> (SKL)
>>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>>
>>>>>> drm-tip:
>>>>>> usecs : count distribution
>>>>>> 0 -> 1 : 0 | |
>>>>>> 2 -> 3 : 0 | |
>>>>>> 4 -> 7 : 0 | |
>>>>>> 8 -> 15 : 44 | |
>>>>>> 16 -> 31 : 1088 | |
>>>>>> 32 -> 63 : 832 | |
>>>>>> 64 -> 127 : 0 | |
>>>>>> 128 -> 255 : 0 | |
>>>>>> 256 -> 511 : 12 | |
>>>>>> 512 -> 1023 : 0 | |
>>>>>> 1024 -> 2047 : 29899 |********* |
>>>>>> 2048 -> 4095 : 131033 |****************************************|
>>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>>> spin/sleep, in which case overall metrics are best
>>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>>> have the 10us sleep.
>>>>>
>>>>> We would also need a typical workload to run, it's profile-guided
>>>>> optimisation after all. Hmm.
>>>>> -Chris
>>>>
>>>> It took me a while to get back to this but I've now had chance to
>>>> run with this exponential backoff scheme on the original system
>>>> that showed the problem. It was a slightly messy back port due to
>>>> the customer tree being much older than current nightly. I'm pretty
>>>> sure I got it correct though. However, I'm not sure what the
>>>> recommendation is for the two timeout values. Using the default of
>>>> '10, 10' in the patch, I still get lots of very long delays.
>>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us
>>> and Wmin=10, Wmax=1000 for wait_for.
>>>
>>> Exponential backoff is more helpful inside wait_for if wait_for_us
>>> prior to wait_for is smaller.
>>> Setting Wmax less than Wmin is effectively changing the backoff
>>> strategy to just linear waits of Wmin.
>>>> I have to up the Wmin value to at least 140 to get a stall free
>>>> result. Which is plausible given that the big spike in the results
>>>> of any fast version is at 110-150us. Also of note is that a Wmin
>>>> between 10 and 110 actually makes things worse. Changing Wmax has
>>>> no effect.
>>>>
>>>> In the following table, 'original' is the original driver before
>>>> any changes and 'retry loop' is the version using the first
>>>> workaround of just running the busy poll wait in a 10x loop. The
>>>> other columns are using the backoff patch with the given Wmin/Wmax
>>>> values. Note that the times are bucketed to 10us up to 500us and
>>>> then in 500us lumps thereafter. The value listed is the lower
>>>> limit, i.e. there were no times of <10us measured. Each case was
>>>> run for 1000 samples.
>>>>
>>> Below setting like in current nightly will suit this workload and as
>>> you have found this will also likely complete most waits in <150us.
>>> If many samples had been beyond 160us and less than 300us we might
>>> have been needed to change Wmin to may be 15 or 20 to ensure the
>>> exponential rise caps around 300us.
>>>
>>> wait_for_us(10, 10)
>>> wait_for()
>>>
>>> #define wait_for _wait_for(10, 1000)
>>>
>> But as shown in the table, a setting of 10/10 does not work well for
>> this workload. The best results possible are a large spike of waits
>> in the 120-130us bucket with a small tail out to 150us. Whereas, the
>> 10/10 setting produces a spike from 150-170us with the tail extending
>> to 240us and an appreciable number of samples stretching all the way
>> out to the 1-10ms range. A regular delay of multiple milliseconds is
>> not acceptable when this path is supposed to be a low latency
>> pre-emption to switch to some super high priority time critical task.
>> And as noted, I did try a bunch of different settings for Wmax but
>> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000
>> produced pretty much identical results. Hence it didn't seem worth
>> including those in the table.
>>
> Wmin = 10us leads us to total delay of 150us in 3 loops (this might be
> tight to catch most conditions)
> Wmin = 25us can lead us to total delay of 175us in 3 loops
>
> Since most conditions are likely to complete around 140us-160us, Looks
> like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
> since this profile driver optimization I am wondering about the
> optimal Wmin point.
>
> This wait need is very time critical. Exponential rise might not be
> good strategy during higher wait times.
> usleep_range might also be adding extra latency.
>
> May be we should do this exponential backoff for waits having US >=
> 1000 and do periodic backoff for US<1000 with period of 50us?
>
The results I am seeing do not correspond. First of, it seems I get
different results depending upon the context. That is in the context of
the pre-emption GuC send action command I get the results previously
posted. If I just run usleep_range(x, y) in loop 1000 times from the
context of dumping a debugfs file, I get something very different.
Basically, the minimum sleep time is 110-120us irrespective of the
values of X and Y. Pushing X and Y beyond 120 seems to make it complete
in Y+10-20us. E.g. u_r(100,200) completes in 210-230us for 80% of
samples. On the other hand, I don't get anywhere near so many samples in
the millisecond range as when called in the send action code path.
However, it sounds like the underlying issue might actually be a
back-port merge problem. The customer tree in question is actually a
combination of a 4.1 base kernel with a 4.11 DRM dropped on top. As
noted in a separate thread, this tree also has a problem with the
mutex_lock() call stalling even when the mutex is very definitely not
acquired (using mutex_trylock() eliminates the stall completely).
Apparently the back port process did hit a bunch of conflicts in the
base kernel's scheduling code. So there is a strong possibility that all
the issues we are seeing in that tree are an artifact of a merge issue.
So I think it is probably safe to ignore the results I am seeing in
terms of what the best upstream solution should be.
>>>> Time Original 10/10 50/10 100/10 110/10
>>>> 130/10 140/10 RetryLoop
>>>> 10us: 2 2 2 2 2
>>>> 2 2 2
>>>> 30us: 1 1 1
>>>> 1 1
>>>> 50us: 1
>>>> 70us: 14 63 56
>>>> 64 63 61
>>>> 80us: 8 41 52
>>>> 44 46 41
>>>> 90us: 6 24 10
>>>> 28 12 17
>>>> 100us: 2 4 20 16
>>>> 17 17 22
>>>> 110us: 13 21 14 13 11
>>>> 120us: 6 366 633
>>>> 636 660 650
>>>> 130us: 2 2 46 125
>>>> 95 86 95
>>>> 140us: 3 2 16 18
>>>> 32 46 48
>>>> 150us: 210 3 12 13
>>>> 37 32 31
>>>> 160us: 322 1 18 10
>>>> 14 12 17
>>>> 170us: 157 4 5 5
>>>> 3 5 2
>>>> 180us: 62 11 3 1
>>>> 2 1 1
>>>> 190us: 32 212 1
>>>> 1 2
>>>> 200us: 27 266 1 1
>>>> 210us: 16
>>>> 181 1
>>>> 220us: 16
>>>> 51 1
>>>> 230us: 10 43 4
>>>> 240us: 12 22 62 1
>>>> 250us: 4 12 112 3
>>>> 260us: 3 13 73 8
>>>> 270us: 5 12 12 8 2
>>>> 280us: 4 7 12 5 1
>>>> 290us: 9 4
>>>> 300us: 1 3 9 1 1
>>>> 310us: 2 3 5 1 1
>>>> 320us: 1 4 2 3
>>>> 330us: 1 5 1
>>>> 340us: 1 2 1
>>>> 350us: 2 1
>>>> 360us: 2 1
>>>> 370us: 2 2
>>>> 380us: 1
>>>> 390us: 2 1 2 1
>>>> 410us: 1
>>>> 420us: 3
>>>> 430us: 2 2 1
>>>> 440us: 2 1
>>>> 450us: 4
>>>> 460us: 3 1
>>>> 470us: 3 1
>>>> 480us: 2 2
>>>> 490us: 1
>>>> 500us: 19 13 17
>>>> 1000us: 249 22 30 11
>>>> 1500us: 393 4 4 2 1
>>>> 2000us: 132 7 8 8 2
>>>> 1 1
>>>> 2500us: 63 4 4 6 1
>>>> 1 1
>>>> 3000us: 59 9 7 6 1
>>>> 3500us: 34 2 1 1
>>>> 4000us: 17 9 4 1
>>>> 4500us: 8 2 1 1
>>>> 5000us: 7 1 2
>>>> 5500us: 7 2 1
>>>> 6000us: 4 2 1 1
>>>> 6500us: 3 1
>>>> 7000us: 6 2 1
>>>> 7500us: 4 1 1
>>>> 8000us: 5 1
>>>> 8500us: 1 1
>>>> 9000us: 2
>>>> 9500us: 2 1
>>>> >10000us: 3 1
>>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20171204/c4094966/attachment-0001.html>
More information about the Intel-gfx
mailing list