[Beignet] [PATCH] Update optimization tips.

Sun Dec 14 17:18:20 PST 2014

Ping for review.

On Fri, Dec 05, 2014 at 04:15:03PM +0800, Zhigang Gong wrote:
> Signed-off-by: Zhigang Gong <zhigang.gong at intel.com>
> ---
>  docs/optimization-guide.mdwn | 106 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 92 insertions(+), 14 deletions(-)
> 
> diff --git a/docs/optimization-guide.mdwn b/docs/optimization-guide.mdwn
> index 8fb29a6..5f648fb 100644
> --- a/docs/optimization-guide.mdwn
> +++ b/docs/optimization-guide.mdwn
> @@ -1,28 +1,106 @@
>  Optimization Guide
>  ====================
>  
> -All the SIMD optimization principle also apply to Beignet optimization.  
> -Furthermore, there are some special tips for Beignet optimization.
> +All the SIMD optimization principles such as avoid branching and don't waste 
> +SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore,
> +there are some special tips for Beignet optimization.
>  
>  1. It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level. 
>     If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
> -   with some SLM usage.  
> +   with some SLM usage.
> +
>  | Amount of SLM | 0  | 4K | 8K  | 16K | 32K |  
> -| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> +| WorkGroup size| 16 | 64 | 128 | 256 | 512 |  
> +
> +   Actually, a good method is to pass in a NULL local work size parameter to let the driver to determine the best work group size for you.
> +
> +1. Use shorter data type could get better performance. There are also some detail tips as below.
> +  1. Use uchar16/ushort8/uint4 as much as possible.
> +  1. If the data has to be DWORD(4 bytes) unaligned, it's better to use vload16(for char), vload8(for short) to load the data.
> +  1. Read/write scalar char/short will be very slow and may lead to be even worse performance than use DW type.
> +
> +1. Avoid too strided global/constant memory access.
> +  
> +  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
> +  `uint x = data[get_global_id(0)];      //best, only read one cache line, no bandwidth waste`  
> +  `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not good, waste half of the bandwidth`  
> +  `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste 15/16 bandwidth.`  
> +
> +1. Avoid dynamic indexed private buffer if possible.
> +  Currently, private buffer access in beignet backend is very slow. Many small private buffer could be optimized by the compiler.
> +  But the following type of dynamic indexed private buffer could not be optimized:
> +
> +`
> +     uint private_buffer[32];
> +     for (i = 0; i < xid; i++)    {
> +       int dynamic_idx = src[xid];
> +       private_buffer[dynamic_idx % 10] = src1[xid];
> +       ...
> +     }
> +`
> +
> +   The following case is OK.
> +
> +`
> +     ...
> +     uint private_buffer[32];
> +     for (i = 0; i < xid; i++)    {
> +       private_buffer[xid % 32] = src1[xid];
> +       ...
> +     }
> +`
> +
> +
> +1. Use SLM to reduce the memory bandwidth requirement if possible.
> +  
> +  On Gen platform, SLM is in GPU's L3 cache, if it could be used to  
> +  share data between work items, it could reduce the memory bandwidth  
> +  on the system memory bus. This will be a big win for many I/O intensity  
> +  kernels.
> +
> +1. Avoid SLM bank conflicts.
> +
> +  SLM is banked at a DWORD granularity, totally 16 banks. Access on the same
> +  bank with different addresses will lead to a conflicts. It should be avoided.
> +  The worst case is:
> +
> +  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
> +  `uint x = data[get_global_id(0)];      //best, no bank conflicts, no bandwidth waste`  
> +  `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste`  
> +  `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth`  
> +  `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.`  
> +
> +1. Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch).
> +
> +  Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page  
> +  aligned host pointer which also has multiple page size. Beignet  
> +  will leverage userptr to create a buffer object by using that  
> +  host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR  
> +  flag to let the driver to allocate a userptr qualified buffer which could  
> +  guarantee zero copy on the buffer.
> +
> +  Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57.
> +
> +1. Use float data type as much as possible.
> +
> +  The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data.
> +
> +1. Avoid using long.
> +
> +  GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development.
> +
> +1. Declare small constant buffer with content in the kernel if possible.
>  
> -2. GEN7's read/write on global memory with DWORD and DWORD4 are significantly faster than read/write on BYTE/WORD.  
> -   Use DWORD or DWORD4 to access data in global memory if possible. If you cannot avoid the byte/word access, try to do it on SLM.
> +  For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel.
>  
> -3. Use float data type as much as possible.
> +1. Avoid unnecessary synchronizations.
>  
> -4. Avoid using long. GEN7's performance for long integer is poor.
> +  Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel.
>  
> -5. If there is a small constant buffer, define it in the kernel instead of using the constant buffer argument if possible.  
> -   The compiler may optimize it if the buffer is defined inside kernel.
> +1. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
>  
> -6. Avoid unnecessary synchronizations, both in the runtime and in the kernel.  For examples, clFinish and clWaitForEvents in runtime  
> -   and barrier() in the kernel.
> +1. Use fma()/mad() as much as possible.
>  
> -7. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
> +1. Try to eliminate branching as much as possible.
>  
> -8. Try to eliminate branching as much as possible. For example using min, max, clamp or select built-ins instead of if/else if possible.
> +  For example using min, max, clamp or select built-ins instead of if/else if possible.
> -- 
> 1.8.3.2
> 
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet