[Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Zhigang Gong zhigang.gong at linux.intel.com
Tue Aug 19 19:59:11 PDT 2014


Tony,

On Tue, Aug 19, 2014 at 03:40:12PM +0000, Moore, Anthony W wrote:
> Ruiling-
> Thanks for your response. Yes, that makes sense. That problem did not occur to me since I saw the utests pass after I tried the change, but understand that some tests could fail. In regards to the DWORD vs BYTE loading performance you speak about, I was curious about the performance and can confirm that 1 byte gather is much slower than DWORD.
> The performance testing I've been doing has been comparing Beignet and Intel's closed source driver. We're seeing much better performance of Intel's driver on Haswell using a custom set of kernels available in OpenCV. I've been focusing on a very simple kernel (RGB2Gray) where Beignet takes 2x longer and suspect that it's the 3 byte loads that's contributing to the slow down. I reduced this to a single load as an experiment and we came very close to Intel's driver. My suspicion is this is the case for many kernels which is why I'm trying to combine loads where possible. I'm wondering how difficult it would be to adapt the readByteAsDWord to extract multiple bytes by reading successive DWORDS or even doing an unaligned oblock read.
I will look into this issue. Hope we can get comparable performance with even unaligned read.

> Another experiment we did was to compare the available flops using a kernel found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver came very close to the theoretical GFLOPs on HSW and BYT with float16, but Beignet was much lower. Looking at the IR/asm it seems that it's not taking advantage of SIMD. Does the backend always split up vectors or only in certain situations?

I have a quick look at the benchmark. It's a very simple loop testing.
I assume some mad instructions should be used, but it's not. I found
the mad pattern matching is disabled by my patch under SIMD16 for some
reason. But my recent experience shows that the mad should be better
even it generats the same count of instructions.
You may try to revert the following patch to see whether it has some
performance impact.
commit d73170df3508d18e250d0af118e3b7955401194f
Author: Zhigang Gong <zhigang.gong at intel.com>
Date:   Thu May 15 13:35:00 2014 +0800

    GBE: disable mad for some cases.

Another point is that beignet hasn't regonize do/while loop as structured basic block,
so we still use unstructured fasion to encode that block and introduce several additional
instructions to maintain the software PCIPs. This should hurt performance, and we will
continue to optimize beignet to fix this gap. Before we get things done, could you
help to provide more details of the performance comparision(relative ratio is good enough,
no need for real scores) under different test data type, for example float/float2/float4
And the data set size, 512K/1M with 256 work group size.

Thanks for your valuable feedback.
Zhigang.

> Thanks,
> Tony
> 
> -----Original Message-----
> From: Song, Ruiling 
> Sent: Monday, August 18, 2014 7:50 PM
> To: Moore, Anthony W; beignet at lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.
> 
> Hi Tony,
> 
> In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in backend.
> Currently if you only change the logic in llvm_loadstore_optimization.cpp to make byte read/write merged, you may get wrong result if the starting address of merged memory access is not-4-byte-aligned.
> The later steps will simply treat 4 byte load as 1 int load (int load always need 4-byte-aligned address).
> And on Gen7, int load is much better than byte load. So you will see significant 
> 
> See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
> 	// read 4 byte as 1 int and unpack it, here starting address must be 4-byte-aligned } else {
>   GBE_ASSERT(insn.getValueNum() == 1);
>   // read 1 int and extract actual byte using some logic-shift
>   // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
> }
> I am not sure if I explain it clearly.
> 
> Could you share me more details about your test? which OpenCV kernels or related performance test in OpenCV? So I could do some performance testing.
> I am not sure if you meet something like vload4(int offset, uchar * p)? OpenCL spec does not ensure the address 'p' is 4-byte-aligned.
> If it is a uchar4* read/write, things will be different, the address is 4-byte-aligned. And the performance is much better than vload4 of uchar* in beignet.
> 
> Thanks!
> Ruiling
> 
> -----Original Message-----
> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of Moore, Anthony W
> Sent: Monday, August 18, 2014 11:47 PM
> To: beignet at lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.
> 
> Hi,
> 
> For this patch http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit and saw some significant performance improvement with some of OpenCV's kernels.
> 
> +        // we only support DWORD data type merge
> +        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;
> 
> Thanks!
> Tony
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet


More information about the Beignet mailing list