[Intel-gfx] [PATCH v3] drm/i915: Use exponential backoff for wait_for()

Mon Dec 4 21:51:54 UTC 2017

On 11/29/2017 11:55 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 12:45 PM, John Harrison wrote:
>> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>>
>>>>>> (SKL)
>>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>>
>>>>>> drm-tip:
>>>>>>       usecs               : count     distribution
>>>>>>           0 -> 1          : 0        |                                        |
>>>>>>           2 -> 3          : 0        |                                        |
>>>>>>           4 -> 7          : 0        |                                        |
>>>>>>           8 -> 15         : 44       |                                        |
>>>>>>          16 -> 31         : 1088     |                                        |
>>>>>>          32 -> 63         : 832      |                                        |
>>>>>>          64 -> 127        : 0        |                                        |
>>>>>>         128 -> 255        : 0        |                                        |
>>>>>>         256 -> 511        : 12       |                                        |
>>>>>>         512 -> 1023       : 0        |                                        |
>>>>>>        1024 -> 2047       : 29899    |*********                               |
>>>>>>        2048 -> 4095       : 131033   |****************************************|
>>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>>> spin/sleep, in which case overall metrics are best
>>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>>> have the 10us sleep.
>>>>>
>>>>> We would also need a typical workload to run, it's profile-guided
>>>>> optimisation after all. Hmm.
>>>>> -Chris
>>>>
>>>> It took me a while to get back to this but I've now had chance to 
>>>> run with this exponential backoff scheme on the original system 
>>>> that showed the problem. It was a slightly messy back port due to 
>>>> the customer tree being much older than current nightly. I'm pretty 
>>>> sure I got it correct though. However, I'm not sure what the 
>>>> recommendation is for the two timeout values. Using the default of 
>>>> '10, 10' in the patch, I still get lots of very long delays. 
>>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us 
>>> and Wmin=10, Wmax=1000 for wait_for.
>>>
>>> Exponential backoff is more helpful inside wait_for if wait_for_us 
>>> prior to wait_for is smaller.
>>> Setting Wmax less than Wmin is effectively changing the backoff 
>>> strategy to just linear waits of Wmin.
>>>> I have to up the Wmin value to at least 140 to get a stall free 
>>>> result. Which is plausible given that the big spike in the results 
>>>> of any fast version is at 110-150us. Also of note is that a Wmin 
>>>> between 10 and 110 actually makes things worse. Changing Wmax has 
>>>> no effect.
>>>>
>>>> In the following table, 'original' is the original driver before 
>>>> any changes and 'retry loop' is the version using the first 
>>>> workaround of just running the busy poll wait in a 10x loop. The 
>>>> other columns are using the backoff patch with the given Wmin/Wmax 
>>>> values. Note that the times are bucketed to 10us up to 500us and 
>>>> then in 500us lumps thereafter. The value listed is the lower 
>>>> limit, i.e. there were no times of <10us measured. Each case was 
>>>> run for 1000 samples.
>>>>
>>> Below setting like in current nightly will suit this workload and as 
>>> you have found this will also likely complete most waits in <150us.
>>> If many samples had been beyond 160us and less than 300us we might 
>>> have been needed to change Wmin to may be 15 or 20 to ensure the
>>> exponential rise caps around 300us.
>>>
>>> wait_for_us(10, 10)
>>> wait_for()
>>>
>>> #define wait_for _wait_for(10, 1000)
>>>
>> But as shown in the table, a setting of 10/10 does not work well for 
>> this workload. The best results possible are a large spike of waits 
>> in the 120-130us bucket with a small tail out to 150us. Whereas, the 
>> 10/10 setting produces a spike from 150-170us with the tail extending 
>> to 240us and an appreciable number of samples stretching all the way 
>> out to the 1-10ms range. A regular delay of multiple milliseconds is 
>> not acceptable when this path is supposed to be a low latency 
>> pre-emption to switch to some super high priority time critical task. 
>> And as noted, I did try a bunch of different settings for Wmax but 
>> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000 
>> produced pretty much identical results. Hence it didn't seem worth 
>> including those in the table.
>>
> Wmin = 10us leads us to total delay of 150us in 3 loops (this might be 
> tight to catch most conditions)
> Wmin = 25us can lead us to total delay of 175us in 3 loops
>
> Since most conditions are likely to complete around 140us-160us, Looks 
> like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
> since this profile driver optimization I am wondering about the 
> optimal Wmin point.
>
> This wait need is very time critical. Exponential rise might not be 
> good strategy during higher wait times.
> usleep_range might also be adding extra latency.
>
> May be we should do this exponential backoff for waits having US >= 
> 1000 and do periodic backoff for US<1000 with period of 50us?
>

The results I am seeing do not correspond. First of, it seems I get 
different results depending upon the context. That is in the context of 
the pre-emption GuC send action command I get the results previously 
posted. If I just run usleep_range(x, y) in loop 1000 times from the 
context of dumping a debugfs file, I get something very different. 
Basically, the minimum sleep time is 110-120us irrespective of the 
values of X and Y. Pushing X and Y beyond 120 seems to make it complete 
in Y+10-20us. E.g. u_r(100,200) completes in 210-230us for 80% of 
samples. On the other hand, I don't get anywhere near so many samples in 
the millisecond range as when called in the send action code path.

However, it sounds like the underlying issue might actually be a 
back-port merge problem. The customer tree in question is actually a 
combination of a 4.1 base kernel with a 4.11 DRM dropped on top. As 
noted in a separate thread, this tree also has a problem with the 
mutex_lock() call stalling even when the mutex is very definitely not 
acquired (using mutex_trylock() eliminates the stall completely). 
Apparently the back port process did hit a bunch of conflicts in the 
base kernel's scheduling code. So there is a strong possibility that all 
the issues we are seeing in that tree are an artifact of a merge issue.

So I think it is probably safe to ignore the results I am seeing in 
terms of what the best upstream solution should be.

>>>>     Time        Original    10/10 50/10    100/10    110/10    
>>>> 130/10    140/10 RetryLoop
>>>>     10us:          2         2         2 2         2         
>>>> 2         2         2
>>>>     30us:                              1 1         1         
>>>> 1         1
>>>>     50us:                              1
>>>>     70us:                             14 63        56        
>>>> 64        63        61
>>>>     80us:                              8 41        52        
>>>> 44        46        41
>>>>     90us:                              6 24        10        
>>>> 28        12        17
>>>>    100us:                    2         4 20        16        
>>>> 17        17        22
>>>>    110us: 13        21        14        13        11
>>>>    120us:                              6 366       633       
>>>> 636       660       650
>>>>    130us:                    2         2 46       125        
>>>> 95        86        95
>>>>    140us:                    3         2 16        18        
>>>> 32        46        48
>>>>    150us:                  210         3 12        13        
>>>> 37        32        31
>>>>    160us:                  322         1 18        10        
>>>> 14        12        17
>>>>    170us:                  157         4 5         5         
>>>> 3         5         2
>>>>    180us:                   62        11 3         1         
>>>> 2         1         1
>>>>    190us:                   32       212 1                   
>>>> 1         2
>>>>    200us:                   27       266 1                   1
>>>>    210us:                   16 
>>>> 181                                                 1
>>>>    220us:                   16 
>>>> 51                                       1
>>>>    230us:                   10        43         4
>>>>    240us:                   12        22 62         1
>>>>    250us:                    4        12 112         3
>>>>    260us:                    3        13 73         8
>>>>    270us:                    5        12 12         8         2
>>>>    280us:                    4         7 12         5         1
>>>>    290us:                              9         4
>>>>    300us:                    1         3 9         1         1
>>>>    310us:                    2         3 5         1         1
>>>>    320us:                    1         4 2         3
>>>>    330us:                    1         5         1
>>>>    340us:                    1 2                   1
>>>>    350us:                              2         1
>>>>    360us:                              2         1
>>>>    370us:                    2                   2
>>>>    380us:                                        1
>>>>    390us:                    2         1 2         1
>>>>    410us:                    1
>>>>    420us:                    3
>>>>    430us:                    2         2         1
>>>>    440us:                    2         1
>>>>    450us:                              4
>>>>    460us:                    3         1
>>>>    470us:                              3         1
>>>>    480us:                    2                   2
>>>>    490us:                                        1
>>>>    500us:                   19        13        17
>>>>   1000us:        249        22        30        11
>>>>   1500us:        393         4         4 2         1
>>>>   2000us:        132         7         8 8         2         
>>>> 1                   1
>>>>   2500us:         63         4         4 6         1         
>>>> 1         1
>>>>   3000us:         59         9         7 6         1
>>>>   3500us:         34         2 1                             1
>>>>   4000us:         17         9         4         1
>>>>   4500us:          8         2         1         1
>>>>   5000us:          7         1         2
>>>>   5500us:          7         2                   1
>>>>   6000us:          4         2         1         1
>>>>   6500us:          3                             1
>>>>   7000us:          6         2                   1
>>>>   7500us:          4 1                             1
>>>>   8000us:          5                             1
>>>>   8500us:                    1         1
>>>>   9000us:          2
>>>>   9500us:          2         1
>>>> >10000us:          3                             1
>>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20171204/c4094966/attachment-0001.html>