[Beignet] [PATCH] Update optimization tips.
Zhigang Gong
zhigang.gong at linux.intel.com
Sun Dec 14 17:18:20 PST 2014
Ping for review.
On Fri, Dec 05, 2014 at 04:15:03PM +0800, Zhigang Gong wrote:
> Signed-off-by: Zhigang Gong <zhigang.gong at intel.com>
> ---
> docs/optimization-guide.mdwn | 106 +++++++++++++++++++++++++++++++++++++------
> 1 file changed, 92 insertions(+), 14 deletions(-)
>
> diff --git a/docs/optimization-guide.mdwn b/docs/optimization-guide.mdwn
> index 8fb29a6..5f648fb 100644
> --- a/docs/optimization-guide.mdwn
> +++ b/docs/optimization-guide.mdwn
> @@ -1,28 +1,106 @@
> Optimization Guide
> ====================
>
> -All the SIMD optimization principle also apply to Beignet optimization.
> -Furthermore, there are some special tips for Beignet optimization.
> +All the SIMD optimization principles such as avoid branching and don't waste
> +SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore,
> +there are some special tips for Beignet optimization.
>
> 1. It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level.
> If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
> - with some SLM usage.
> + with some SLM usage.
> +
> | Amount of SLM | 0 | 4K | 8K | 16K | 32K |
> -| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> +| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
> +
> + Actually, a good method is to pass in a NULL local work size parameter to let the driver to determine the best work group size for you.
> +
> +1. Use shorter data type could get better performance. There are also some detail tips as below.
> + 1. Use uchar16/ushort8/uint4 as much as possible.
> + 1. If the data has to be DWORD(4 bytes) unaligned, it's better to use vload16(for char), vload8(for short) to load the data.
> + 1. Read/write scalar char/short will be very slow and may lead to be even worse performance than use DW type.
> +
> +1. Avoid too strided global/constant memory access.
> +
> + Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):
> + `uint x = data[get_global_id(0)]; //best, only read one cache line, no bandwidth waste`
> + `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not good, waste half of the bandwidth`
> + `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste 15/16 bandwidth.`
> +
> +1. Avoid dynamic indexed private buffer if possible.
> + Currently, private buffer access in beignet backend is very slow. Many small private buffer could be optimized by the compiler.
> + But the following type of dynamic indexed private buffer could not be optimized:
> +
> +`
> + uint private_buffer[32];
> + for (i = 0; i < xid; i++) {
> + int dynamic_idx = src[xid];
> + private_buffer[dynamic_idx % 10] = src1[xid];
> + ...
> + }
> +`
> +
> + The following case is OK.
> +
> +`
> + ...
> + uint private_buffer[32];
> + for (i = 0; i < xid; i++) {
> + private_buffer[xid % 32] = src1[xid];
> + ...
> + }
> +`
> +
> +
> +1. Use SLM to reduce the memory bandwidth requirement if possible.
> +
> + On Gen platform, SLM is in GPU's L3 cache, if it could be used to
> + share data between work items, it could reduce the memory bandwidth
> + on the system memory bus. This will be a big win for many I/O intensity
> + kernels.
> +
> +1. Avoid SLM bank conflicts.
> +
> + SLM is banked at a DWORD granularity, totally 16 banks. Access on the same
> + bank with different addresses will lead to a conflicts. It should be avoided.
> + The worst case is:
> +
> + Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):
> + `uint x = data[get_global_id(0)]; //best, no bank conflicts, no bandwidth waste`
> + `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste`
> + `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth`
> + `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.`
> +
> +1. Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch).
> +
> + Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page
> + aligned host pointer which also has multiple page size. Beignet
> + will leverage userptr to create a buffer object by using that
> + host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR
> + flag to let the driver to allocate a userptr qualified buffer which could
> + guarantee zero copy on the buffer.
> +
> + Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57.
> +
> +1. Use float data type as much as possible.
> +
> + The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data.
> +
> +1. Avoid using long.
> +
> + GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development.
> +
> +1. Declare small constant buffer with content in the kernel if possible.
>
> -2. GEN7's read/write on global memory with DWORD and DWORD4 are significantly faster than read/write on BYTE/WORD.
> - Use DWORD or DWORD4 to access data in global memory if possible. If you cannot avoid the byte/word access, try to do it on SLM.
> + For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel.
>
> -3. Use float data type as much as possible.
> +1. Avoid unnecessary synchronizations.
>
> -4. Avoid using long. GEN7's performance for long integer is poor.
> + Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel.
>
> -5. If there is a small constant buffer, define it in the kernel instead of using the constant buffer argument if possible.
> - The compiler may optimize it if the buffer is defined inside kernel.
> +1. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
>
> -6. Avoid unnecessary synchronizations, both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime
> - and barrier() in the kernel.
> +1. Use fma()/mad() as much as possible.
>
> -7. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
> +1. Try to eliminate branching as much as possible.
>
> -8. Try to eliminate branching as much as possible. For example using min, max, clamp or select built-ins instead of if/else if possible.
> + For example using min, max, clamp or select built-ins instead of if/else if possible.
> --
> 1.8.3.2
>
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
More information about the Beignet
mailing list