[Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on intervals

Wed May 4 07:05:46 UTC 2016

> I think this may lead to optimize for a special input-range.

I agree - my ideea with the benchmark was just looking at how fast is a the function on an interval. 
I've looked at a function - say sinus - and saw that there are 3 paths the code may take based on the input value. To properly evaluate the performance of each path I would do x = sin (x + a) then x = x * 0x1p-16, where a is the min value of an interval, and x is close to 0. So at the end I would know that the performance is T on (a, b) interval, 2T on (b, c) interval etc. 
Now this won't tell me how sinus actually performs since I don't know how often sinus with values (a, b) is called vs (b, c) or other - but it would tell me for instance that internal_1 performance (a, b) is 6 times faster than internal_2 (b, c) and 9 times than internal_3 (c, d) etc..

Coming back to your observation if I run a test (ex Luxmark) and if I see (b, c) is called most times, and optimize that path and Luxmark doubles in performance (let's say) but another test Madelbulb which mainly uses sinus on (c, d) doesn't see any increase, then of course, I just optimized for an interval and it's wrong to base a general sinus performance evaluation on Luxmark alone - in this example case.. so I agree with you about your observation, but I am not doing that, yet anyway :)

> I don't quite understand what do you mean by "reiterating through an interval would not offer real world performance"?
> I don't have any good idea, by from my understanding, a large input-value range is ok. Any comments?

Say I want to test sinus. As I said above sinus has 3 intervals of path code all internal, nothing is native. There are 2 reduction methods (normal numbers and very high numbers) and no reduction on a very low interval. If I reiterate over the whole interval of numbers (-inf, inf) I actually measure the following
Overall_performance = Performance_interval_1 * (size interval1) + Performance_interval_2 * (size interval2) + Performance_interval_3 * (size interval3). And this will give a general view on the function, I agree. But this doesn't take into account what is the general, reald world use of sinus. 

For instance if interval 3 is the largest and offers lowest performance (let's assume) but is almost never called in tests that use sinus than it wouln't help to know how the general all path performance is for sinus (because people who use sinus might know that reduction is a big performance penalty on large numbers).

I believe the only way to evaluate if a change in math code is relevant is with real world tests. We thus must have a diverse set of tests that use most math functions. Ideally one should document what each test uses and in what proportion. I have starting doing this but it's taking a lot of time due to the complexity of some tests (e.g. Luxmark).

For instance it would be nice to know Luxmark is impacted "high" {log2} "normal" {sinus, cosinus, pow, exp} "low" {sqrt, tan}. So that if I optimize tan I should not expect Luxmark to change much - but if I target log2 than it should change. Again if you think some math functions have intervals than the problem is even more difficult since maybe Luxmark only goes to use (a, b) internval on sinus :)

> I think we should separate the benchmark test from the real implementation.

Yes, agree - these tests are only to help improve the current Beignet math implementation. They have use in other implementations if one knows the underlying interval sets.

-------------------------------------------------------------

So I see the following flow of optimization for Beignet - but may apply to any other math implemention for OpenCL:

1. (done) See performance of each interval for a given function (sin). We would know perf1 on (a, b), perf2 on (b, c), perf3 on (c, d)
2. (working) Run several relevant math tests (relevant to sinus). Try to identify in what circumstances is sin called. Maybe all tests call it on (a,b) and (b,c). Then we should target (a,b) and (b,c) because that is what is being used. This would assume math tests are well chosen and diverse.
3. (working) Optimize intervals (a, b) and (b, c). Observe how each optimized since we can test performance on intervals. Re-run real world math tests.
Any thoughts on this ?

I did some optimizations (call to native and polynomial reduction) and obtained an increase of at least 5% in about 8 - 10 math tests from the ones provided by Mengmeng. It's quite difficult to target the general case for all math functions but I think these changes are relevant to some point. 

-----Original Message-----
From: Song, Ruiling 
Sent: Tuesday, May 3, 2016 2:07 PM
To: Lupescu, Grigore <grigore.lupescu at intel.com>; beignet at lists.freedesktop.org
Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on intervals

> -----Original Message-----
> From: Lupescu, Grigore
> Sent: Monday, May 2, 2016 12:32 PM
> To: Song, Ruiling <ruiling.song at intel.com>; 
> beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math 
> performance on intervals
> 
> Regarding the first question - For math functions I made the 
> benchmarks to evaluate the gaps of performance between native and different paths of internal.
> So I would understand where should I maybe focus on optimization.
I think this may lead to optimize for a special input-range.
But optimizing for a special input range may be harmful unless the input NORMALLY lies in that range on GPU.
If the input data is in different range, the runtime instruction count will be increased.
I think we should try to optimize for wider input range, minimize if-else check.
> 
> I never meant to make a general all purpose benchmark for any driver - 
> I find that quite difficult since I don't think just reiterating 
> through an interval would offer real world performance. If you have 
> any ideas here though, would be great :)
I don't quite understand what do you mean by "reiterating through an interval would not offer real world performance"?
I think benchmark using a large input-value range is just enough when doing comparison with native_version or with other opencl implementation.
I don't have any good idea, by from my understanding, a large input-value range is ok. Any comments?

> 
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, May 2, 2016 5:10 AM
> To: Lupescu, Grigore <grigore.lupescu at intel.com>; 
> beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math 
> performance on intervals
> 
> 
> 
> > -----Original Message-----
> > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On 
> > Behalf Of Grigore Lupescu
> > Sent: Monday, May 2, 2016 3:04 AM
> > To: beignet at lists.freedesktop.org
> > Subject: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance 
> > on intervals
> >
> > From: Grigore Lupescu <grigore.lupescu at intel.com>
> >
> > Functions to benchmark math functions on intervals.
> > Tests: sin, cos, exp2, exp, exp10, log2, log, log10
> >
> > Signed-off-by: Grigore Lupescu <grigore.lupescu at intel.com>
> > ---
> >  benchmark/CMakeLists.txt     |   3 +-
> >  benchmark/benchmark_math.cpp | 126 ++++++++++++++++++++
> >  kernels/bench_math.cl        | 272
> > +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 400 insertions(+), 1 deletion(-)  create mode 
> > 100644 benchmark/benchmark_math.cpp  create mode 100644
> kernels/bench_math.cl
> >
> > diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt 
> > index
> > dd33829..4c3c933 100644
> > --- a/benchmark/CMakeLists.txt
> > +++ b/benchmark/CMakeLists.txt
> > @@ -18,7 +18,8 @@ set (benchmark_sources
> >    benchmark_copy_buffer_to_image.cpp
> >    benchmark_copy_image_to_buffer.cpp
> >    benchmark_copy_buffer.cpp
> > -  benchmark_copy_image.cpp)
> > +  benchmark_copy_image.cpp
> > +  benchmark_math.cpp)
> >
> > +/* calls internal fast (native) if (x > -0x1.6p1 && x < 0x1.6p1) */ 
> > +kernel void bench_math_exp(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_exp(-0x1.6p1 - result); /* calls native */ 
> > +#elif
> > +defined(BENCHMARK_INTERNAL_FAST)
> > +    result = exp(-0x1.6p1 + result); /* calls internal fast */ #else
> > +    result = exp(-0x1.6p1 - result); /* calls internal slow */ 
> > +#endif
> 
> I think we should separate the benchmark test from the real implementation.
> Then we can make easy comparison with other driver implementation and 
> Also the implementation in Beignet may change in the future.
> What's your idea on this?
> 
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +
> 
> > +/* benchmark sin performance */
> > +kernel void bench_math_sin(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_sin(result); /* calls native */ #else
> > +    result = sin(result);	/* calls internal, random complexity */
> 
> What's the range of 'result'? Seems very small? I think we need to 
> make sure the input argument to sin() in a large range.
> As we need try to optimize for general case.
> 
> Thanks!
> Ruiling
> > +    //result = sin(0.1f + result); /* calls internal, (1) no reduction */
> > +    //result = sin(2.f + result); /* calls internal, (2) fast reduction */
> > +    //result = sin(4001 + result); /* calls internal, (3) slow reduction */
> > +    result *= 0x1p-16;
> > +#endif
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +