[Beignet] [PATCH] Update optimization tips.

Mon Dec 15 18:22:33 PST 2014

LGTM, thanks.

> -----Original Message-----
> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of
> Zhigang Gong
> Sent: Monday, December 15, 2014 09:18
> To: Gong, Zhigang
> Cc: beignet at lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH] Update optimization tips.
> 
> Ping for review.
> 
> On Fri, Dec 05, 2014 at 04:15:03PM +0800, Zhigang Gong wrote:
> > Signed-off-by: Zhigang Gong <zhigang.gong at intel.com>
> > ---
> >  docs/optimization-guide.mdwn | 106
> > +++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 92 insertions(+), 14 deletions(-)
> >
> > diff --git a/docs/optimization-guide.mdwn
> > b/docs/optimization-guide.mdwn index 8fb29a6..5f648fb 100644
> > --- a/docs/optimization-guide.mdwn
> > +++ b/docs/optimization-guide.mdwn
> > @@ -1,28 +1,106 @@
> >  Optimization Guide
> >  ====================
> >
> > -All the SIMD optimization principle also apply to Beignet optimization.
> > -Furthermore, there are some special tips for Beignet optimization.
> > +All the SIMD optimization principles such as avoid branching and
> > +don't waste SIMD lanes are also applied to Beignet optimization on
> > +Gen platform. Furthermore, there are some special tips for Beignet
> optimization.
> >
> >  1. It is recommended to choose multiple of 16 work group size. Too much
> SLM usage may reduce parallelism at group level.
> >     If kernel uses large amount SLM, it's better to choose large work group
> size. Please refer the following table for recommendations
> > -   with some SLM usage.
> > +   with some SLM usage.
> > +
> >  | Amount of SLM | 0  | 4K | 8K  | 16K | 32K |
> > -| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> > +| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> > +
> > +   Actually, a good method is to pass in a NULL local work size parameter to
> let the driver to determine the best work group size for you.
> > +
> > +1. Use shorter data type could get better performance. There are also
> some detail tips as below.
> > +  1. Use uchar16/ushort8/uint4 as much as possible.
> > +  1. If the data has to be DWORD(4 bytes) unaligned, it's better to use
> vload16(for char), vload8(for short) to load the data.
> > +  1. Read/write scalar char/short will be very slow and may lead to be even
> worse performance than use DW type.
> > +
> > +1. Avoid too strided global/constant memory access.
> > +
> > +  Some examples are as below (assume the data is a cache line aligned
> global/constant uint buffer, and the work group size is 16 with SIMD16 mode):
> > +  `uint x = data[get_global_id(0)];      //best, only read one cache line, no
> bandwidth waste`
> > +  `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines,
> > + not good, waste half of the bandwidth`  `uint x =
> > + data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste
> > + 15/16 bandwidth.`
> > +
> > +1. Avoid dynamic indexed private buffer if possible.
> > +  Currently, private buffer access in beignet backend is very slow. Many
> small private buffer could be optimized by the compiler.
> > +  But the following type of dynamic indexed private buffer could not be
> optimized:
> > +
> > +`
> > +     uint private_buffer[32];
> > +     for (i = 0; i < xid; i++)    {
> > +       int dynamic_idx = src[xid];
> > +       private_buffer[dynamic_idx % 10] = src1[xid];
> > +       ...
> > +     }
> > +`
> > +
> > +   The following case is OK.
> > +
> > +`
> > +     ...
> > +     uint private_buffer[32];
> > +     for (i = 0; i < xid; i++)    {
> > +       private_buffer[xid % 32] = src1[xid];
> > +       ...
> > +     }
> > +`
> > +
> > +
> > +1. Use SLM to reduce the memory bandwidth requirement if possible.
> > +
> > +  On Gen platform, SLM is in GPU's L3 cache, if it could be used to
> > + share data between work items, it could reduce the memory bandwidth
> > + on the system memory bus. This will be a big win for many I/O
> > + intensity  kernels.
> > +
> > +1. Avoid SLM bank conflicts.
> > +
> > +  SLM is banked at a DWORD granularity, totally 16 banks. Access on
> > + the same  bank with different addresses will lead to a conflicts. It should
> be avoided.
> > +  The worst case is:
> > +
> > +  Some examples are as below (assume the data is a cache line aligned
> global/constant uint buffer, and the work group size is 16 with SIMD16 mode):
> > +  `uint x = data[get_global_id(0)];      //best, no bank conflicts, no
> bandwidth waste`
> > +  `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no
> > + bandwidth waste`  `uint x = data[get_global_id(0) * 2]; //bad, work
> > + item (id) and (id + 8) conflict to each other, waste half of the
> > + bandwidth`  `uint x = data[get_global_id(0) * 16]; //worst, all work
> > + items conflicts on the zero bank, waste 15/16 bandwidth.`
> > +
> > +1. Zero copy on buffer creation. (Only avaliable in git master branch and
> Release\_v1.0 branch).
> > +
> > +  Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page
> > + aligned host pointer which also has multiple page size. Beignet
> > + will leverage userptr to create a buffer object by using that  host
> > + buffer directly. If possible, you can also use
> > + CL\_MEM\_ALLOC\_HOST\_PTR  flag to let the driver to allocate a
> > + userptr qualified buffer which could  guarantee zero copy on the buffer.
> > +
> > +  Please be noted, this feature requires the kernel is newer than 3.16 and
> the libdrm version is newer than 2.4.57.
> > +
> > +1. Use float data type as much as possible.
> > +
> > +  The two ALUs of one EU could both handle float data,but only one of
> them could handle non-float type data.
> > +
> > +1. Avoid using long.
> > +
> > +  GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long
> support is still under development.
> > +
> > +1. Declare small constant buffer with content in the kernel if possible.
> >
> > -2. GEN7's read/write on global memory with DWORD and DWORD4 are
> significantly faster than read/write on BYTE/WORD.
> > -   Use DWORD or DWORD4 to access data in global memory if possible. If
> you cannot avoid the byte/word access, try to do it on SLM.
> > +  For a small constant buffer, it's better to declare it in the kernel directly
> with "const \_\_constant". The compiler may optimize it if the buffer is
> defined inside kernel.
> >
> > -3. Use float data type as much as possible.
> > +1. Avoid unnecessary synchronizations.
> >
> > -4. Avoid using long. GEN7's performance for long integer is poor.
> > +  Both in the runtime and in the kernel. For examples, clFinish and
> clWaitForEvents in runtime and barrier() in the kernel.
> >
> > -5. If there is a small constant buffer, define it in the kernel instead of using
> the constant buffer argument if possible.
> > -   The compiler may optimize it if the buffer is defined inside kernel.
> > +1. Consider native version of math built-ins, such as native\_sin,
> native\_cos, if your kernel is not precision sensitive.
> >
> > -6. Avoid unnecessary synchronizations, both in the runtime and in the
> kernel.  For examples, clFinish and clWaitForEvents in runtime
> > -   and barrier() in the kernel.
> > +1. Use fma()/mad() as much as possible.
> >
> > -7. Consider native version of math built-ins, such as native\_sin,
> native\_cos, if your kernel is not precision sensitive.
> > +1. Try to eliminate branching as much as possible.
> >
> > -8. Try to eliminate branching as much as possible. For example using min,
> max, clamp or select built-ins instead of if/else if possible.
> > +  For example using min, max, clamp or select built-ins instead of if/else if
> possible.
> > --
> > 1.8.3.2
> >
> > _______________________________________________
> > Beignet mailing list
> > Beignet at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet