[Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on intervals

Tue May 3 11:06:55 UTC 2016


> -----Original Message-----
> From: Lupescu, Grigore
> Sent: Monday, May 2, 2016 12:32 PM
> To: Song, Ruiling <ruiling.song at intel.com>; beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on
> intervals
> 
> Regarding the first question - For math functions I made the benchmarks to
> evaluate the gaps of performance between native and different paths of internal.
> So I would understand where should I maybe focus on optimization.
I think this may lead to optimize for a special input-range.
But optimizing for a special input range may be harmful unless the input NORMALLY lies in that range on GPU.
If the input data is in different range, the runtime instruction count will be increased.
I think we should try to optimize for wider input range, minimize if-else check.
> 
> I never meant to make a general all purpose benchmark for any driver - I find
> that quite difficult since I don't think just reiterating through an interval would
> offer real world performance. If you have any ideas here though, would be
> great :)
I don't quite understand what do you mean by "reiterating through an interval would not offer real world performance"?
I think benchmark using a large input-value range is just enough when doing comparison with native_version or with other opencl implementation.
I don't have any good idea, by from my understanding, a large input-value range is ok. Any comments?

> 
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, May 2, 2016 5:10 AM
> To: Lupescu, Grigore <grigore.lupescu at intel.com>;
> beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on
> intervals
> 
> 
> 
> > -----Original Message-----
> > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf
> > Of Grigore Lupescu
> > Sent: Monday, May 2, 2016 3:04 AM
> > To: beignet at lists.freedesktop.org
> > Subject: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on
> > intervals
> >
> > From: Grigore Lupescu <grigore.lupescu at intel.com>
> >
> > Functions to benchmark math functions on intervals.
> > Tests: sin, cos, exp2, exp, exp10, log2, log, log10
> >
> > Signed-off-by: Grigore Lupescu <grigore.lupescu at intel.com>
> > ---
> >  benchmark/CMakeLists.txt     |   3 +-
> >  benchmark/benchmark_math.cpp | 126 ++++++++++++++++++++
> >  kernels/bench_math.cl        | 272
> > +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 400 insertions(+), 1 deletion(-)  create mode 100644
> > benchmark/benchmark_math.cpp  create mode 100644
> kernels/bench_math.cl
> >
> > diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt index
> > dd33829..4c3c933 100644
> > --- a/benchmark/CMakeLists.txt
> > +++ b/benchmark/CMakeLists.txt
> > @@ -18,7 +18,8 @@ set (benchmark_sources
> >    benchmark_copy_buffer_to_image.cpp
> >    benchmark_copy_image_to_buffer.cpp
> >    benchmark_copy_buffer.cpp
> > -  benchmark_copy_image.cpp)
> > +  benchmark_copy_image.cpp
> > +  benchmark_math.cpp)
> >
> > +/* calls internal fast (native) if (x > -0x1.6p1 && x < 0x1.6p1) */
> > +kernel void bench_math_exp(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_exp(-0x1.6p1 - result); /* calls native */ #elif
> > +defined(BENCHMARK_INTERNAL_FAST)
> > +    result = exp(-0x1.6p1 + result); /* calls internal fast */ #else
> > +    result = exp(-0x1.6p1 - result); /* calls internal slow */ #endif
> 
> I think we should separate the benchmark test from the real implementation.
> Then we can make easy comparison with other driver implementation and Also
> the implementation in Beignet may change in the future.
> What's your idea on this?
> 
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +
> 
> > +/* benchmark sin performance */
> > +kernel void bench_math_sin(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_sin(result); /* calls native */ #else
> > +    result = sin(result);	/* calls internal, random complexity */
> 
> What's the range of 'result'? Seems very small? I think we need to make sure the
> input argument to sin() in a large range.
> As we need try to optimize for general case.
> 
> Thanks!
> Ruiling
> > +    //result = sin(0.1f + result); /* calls internal, (1) no reduction */
> > +    //result = sin(2.f + result); /* calls internal, (2) fast reduction */
> > +    //result = sin(4001 + result); /* calls internal, (3) slow reduction */
> > +    result *= 0x1p-16;
> > +#endif
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +