[Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Wed Aug 20 09:06:32 PDT 2014

Zhigang-

I tested an OpenCL kernel that does native multiplies and adds as well as fmad method and for Beignet and VPG the native instructions performed better. The relative ratio of VPG to Beignet performance of a float16 kernel with 64 workgroup size and 8MB workload size is 46:1 (these sizes performed the best). I don't think the MAD instruction is the issue but rather I suspect VPG is using all SIMD lanes Beignet is "scalarizing" all the instructions.

I did try the patch you suggested and did see a noticeable difference in performance. 

thanks
Tony

-----Original Message-----
From: Zhigang Gong [mailto:zhigang.gong at linux.intel.com] 
Sent: Tuesday, August 19, 2014 7:59 PM
To: Moore, Anthony W
Cc: Song, Ruiling; beignet at lists.freedesktop.org
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Tony,

On Tue, Aug 19, 2014 at 03:40:12PM +0000, Moore, Anthony W wrote:
> Ruiling-
> Thanks for your response. Yes, that makes sense. That problem did not occur to me since I saw the utests pass after I tried the change, but understand that some tests could fail. In regards to the DWORD vs BYTE loading performance you speak about, I was curious about the performance and can confirm that 1 byte gather is much slower than DWORD.
> The performance testing I've been doing has been comparing Beignet and Intel's closed source driver. We're seeing much better performance of Intel's driver on Haswell using a custom set of kernels available in OpenCV. I've been focusing on a very simple kernel (RGB2Gray) where Beignet takes 2x longer and suspect that it's the 3 byte loads that's contributing to the slow down. I reduced this to a single load as an experiment and we came very close to Intel's driver. My suspicion is this is the case for many kernels which is why I'm trying to combine loads where possible. I'm wondering how difficult it would be to adapt the readByteAsDWord to extract multiple bytes by reading successive DWORDS or even doing an unaligned oblock read.
I will look into this issue. Hope we can get comparable performance with even unaligned read.

> Another experiment we did was to compare the available flops using a kernel found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver came very close to the theoretical GFLOPs on HSW and BYT with float16, but Beignet was much lower. Looking at the IR/asm it seems that it's not taking advantage of SIMD. Does the backend always split up vectors or only in certain situations?

I have a quick look at the benchmark. It's a very simple loop testing.
I assume some mad instructions should be used, but it's not. I found the mad pattern matching is disabled by my patch under SIMD16 for some reason. But my recent experience shows that the mad should be better even it generats the same count of instructions.
You may try to revert the following patch to see whether it has some performance impact.
commit d73170df3508d18e250d0af118e3b7955401194f
Author: Zhigang Gong <zhigang.gong at intel.com>
Date:   Thu May 15 13:35:00 2014 +0800

    GBE: disable mad for some cases.

Another point is that beignet hasn't regonize do/while loop as structured basic block, so we still use unstructured fasion to encode that block and introduce several additional instructions to maintain the software PCIPs. This should hurt performance, and we will continue to optimize beignet to fix this gap. Before we get things done, could you help to provide more details of the performance comparision(relative ratio is good enough, no need for real scores) under different test data type, for example float/float2/float4 And the data set size, 512K/1M with 256 work group size.

Thanks for your valuable feedback.
Zhigang.

> Thanks,
> Tony
> 
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, August 18, 2014 7:50 PM
> To: Moore, Anthony W; beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.
> 
> Hi Tony,
> 
> In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in backend.
> Currently if you only change the logic in llvm_loadstore_optimization.cpp to make byte read/write merged, you may get wrong result if the starting address of merged memory access is not-4-byte-aligned.
> The later steps will simply treat 4 byte load as 1 int load (int load always need 4-byte-aligned address).
> And on Gen7, int load is much better than byte load. So you will see 
> significant
> 
> See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
> 	// read 4 byte as 1 int and unpack it, here starting address must be 4-byte-aligned } else {
>   GBE_ASSERT(insn.getValueNum() == 1);
>   // read 1 int and extract actual byte using some logic-shift
>   // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
> }
> I am not sure if I explain it clearly.
> 
> Could you share me more details about your test? which OpenCV kernels or related performance test in OpenCV? So I could do some performance testing.
> I am not sure if you meet something like vload4(int offset, uchar * p)? OpenCL spec does not ensure the address 'p' is 4-byte-aligned.
> If it is a uchar4* read/write, things will be different, the address is 4-byte-aligned. And the performance is much better than vload4 of uchar* in beignet.
> 
> Thanks!
> Ruiling
> 
> -----Original Message-----
> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf 
> Of Moore, Anthony W
> Sent: Monday, August 18, 2014 11:47 PM
> To: beignet at lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.
> 
> Hi,
> 
> For this patch http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit and saw some significant performance improvement with some of OpenCV's kernels.
> 
> +        // we only support DWORD data type merge
> +        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;
> 
> Thanks!
> Tony
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet