[Beignet] [PATCH] Update optimization tips.

Fri Dec 5 00:15:03 PST 2014

Signed-off-by: Zhigang Gong <zhigang.gong at intel.com>
---
 docs/optimization-guide.mdwn | 106 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 92 insertions(+), 14 deletions(-)

diff --git a/docs/optimization-guide.mdwn b/docs/optimization-guide.mdwn
index 8fb29a6..5f648fb 100644
--- a/docs/optimization-guide.mdwn
+++ b/docs/optimization-guide.mdwn
@@ -1,28 +1,106 @@
 Optimization Guide
 ====================
 
-All the SIMD optimization principle also apply to Beignet optimization.  
-Furthermore, there are some special tips for Beignet optimization.
+All the SIMD optimization principles such as avoid branching and don't waste 
+SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore,
+there are some special tips for Beignet optimization.
 
 1. It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level. 
    If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
-   with some SLM usage.  
+   with some SLM usage.
+
 | Amount of SLM | 0  | 4K | 8K  | 16K | 32K |  
-| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
+| WorkGroup size| 16 | 64 | 128 | 256 | 512 |  
+
+   Actually, a good method is to pass in a NULL local work size parameter to let the driver to determine the best work group size for you.
+
+1. Use shorter data type could get better performance. There are also some detail tips as below.
+  1. Use uchar16/ushort8/uint4 as much as possible.
+  1. If the data has to be DWORD(4 bytes) unaligned, it's better to use vload16(for char), vload8(for short) to load the data.
+  1. Read/write scalar char/short will be very slow and may lead to be even worse performance than use DW type.
+
+1. Avoid too strided global/constant memory access.
+  
+  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
+  `uint x = data[get_global_id(0)];      //best, only read one cache line, no bandwidth waste`  
+  `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not good, waste half of the bandwidth`  
+  `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste 15/16 bandwidth.`  
+
+1. Avoid dynamic indexed private buffer if possible.
+  Currently, private buffer access in beignet backend is very slow. Many small private buffer could be optimized by the compiler.
+  But the following type of dynamic indexed private buffer could not be optimized:
+
+`
+     uint private_buffer[32];
+     for (i = 0; i < xid; i++)    {
+       int dynamic_idx = src[xid];
+       private_buffer[dynamic_idx % 10] = src1[xid];
+       ...
+     }
+`
+
+   The following case is OK.
+
+`
+     ...
+     uint private_buffer[32];
+     for (i = 0; i < xid; i++)    {
+       private_buffer[xid % 32] = src1[xid];
+       ...
+     }
+`
+
+
+1. Use SLM to reduce the memory bandwidth requirement if possible.
+  
+  On Gen platform, SLM is in GPU's L3 cache, if it could be used to  
+  share data between work items, it could reduce the memory bandwidth  
+  on the system memory bus. This will be a big win for many I/O intensity  
+  kernels.
+
+1. Avoid SLM bank conflicts.
+
+  SLM is banked at a DWORD granularity, totally 16 banks. Access on the same
+  bank with different addresses will lead to a conflicts. It should be avoided.
+  The worst case is:
+
+  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
+  `uint x = data[get_global_id(0)];      //best, no bank conflicts, no bandwidth waste`  
+  `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste`  
+  `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth`  
+  `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.`  
+
+1. Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch).
+
+  Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page  
+  aligned host pointer which also has multiple page size. Beignet  
+  will leverage userptr to create a buffer object by using that  
+  host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR  
+  flag to let the driver to allocate a userptr qualified buffer which could  
+  guarantee zero copy on the buffer.
+
+  Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57.
+
+1. Use float data type as much as possible.
+
+  The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data.
+
+1. Avoid using long.
+
+  GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development.
+
+1. Declare small constant buffer with content in the kernel if possible.
 
-2. GEN7's read/write on global memory with DWORD and DWORD4 are significantly faster than read/write on BYTE/WORD.  
-   Use DWORD or DWORD4 to access data in global memory if possible. If you cannot avoid the byte/word access, try to do it on SLM.
+  For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel.
 
-3. Use float data type as much as possible.
+1. Avoid unnecessary synchronizations.
 
-4. Avoid using long. GEN7's performance for long integer is poor.
+  Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel.
 
-5. If there is a small constant buffer, define it in the kernel instead of using the constant buffer argument if possible.  
-   The compiler may optimize it if the buffer is defined inside kernel.
+1. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
 
-6. Avoid unnecessary synchronizations, both in the runtime and in the kernel.  For examples, clFinish and clWaitForEvents in runtime  
-   and barrier() in the kernel.
+1. Use fma()/mad() as much as possible.
 
-7. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
+1. Try to eliminate branching as much as possible.
 
-8. Try to eliminate branching as much as possible. For example using min, max, clamp or select built-ins instead of if/else if possible.
+  For example using min, max, clamp or select built-ins instead of if/else if possible.
-- 
1.8.3.2