[Beignet] [PATCH] Add optimization guide.

Tue Jun 24 07:59:29 PDT 2014

Some wording comments as below:
On Tue, Jun 24, 2014 at 12:28:51AM +0800, Yang Rong wrote:
> Signed-off-by: Yang Rong <rong.r.yang at intel.com>
> ---
>  docs/optimization-guide.mdwn | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>  create mode 100644 docs/optimization-guide.mdwn
> 
> diff --git a/docs/optimization-guide.mdwn b/docs/optimization-guide.mdwn
> new file mode 100644
> index 0000000..4f620ab
> --- /dev/null
> +++ b/docs/optimization-guide.mdwn
> @@ -0,0 +1,27 @@
> +Optimization Guide
> +====================
> +
> +All the SIMD optimization principle also apply to Beignet optimization.  
> +Furthermore, there are some special tips for Beignet optimization.
> +
> +1. Choose the work group size multiple by 16 is recommendation. SLM also affect parallelism.
      It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level.   
> +If kernel use large amount SLM, recommend use large  work group size.  The following is a recommendations work group size.
   If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
   with some SLM usage.
> +| Amount of SLM | 0  | 4K | 8K  | 16K | 32K |  
> +| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> +
> +2. GEN7's read/write on global memory with DWORD and DWORD4 are significantly faster than read/write on BYTE/WORD.  
> +   Use DWORD or DWORD4 to access data in global memory if possible. If you cannot avoid the byte/word access, try to do it on SLM.
> +
> +3. GEN7's float performance is better than that of int, try to operate on float instead of using int if possible.
      Use float as much as possible.
> +
> +4. Avoid using long. GEN7's performance for long integer is poor.
> +
> +5. If there is a small constant buffer, define it in the kernel instead of using the constant buffer argument if possible.  
> +   The compiler may optimize it if the buffer is defined inside kernel.
> +
> +6. Avoid unnecessary synchronizations, both in the runtime and in the kernel.  For examples, clFinish and clWaitForEvents in runtime  
> +   and barrier() in the kernel.
> +
> +7. Consider native version of math built-ins, such as native_sin, native_cos, if your kernel is not precision sensitive.
> +
> +8. Try to minimize branching. For example using min, max, clamp or select built-ins instead of if/else if possible.
> -- 
> 1.8.3.2
> 
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet