[Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Tue Aug 19 10:54:14 PDT 2014

Since we're discussing performance do you know what it would take to expose the Timestamp register to an OpenCL kernel? It would enable people to profile sections of their code. Seems like the assembly would just be a MOV, but all of the LLVM logic is foreign to me.
thanks

-----Original Message-----
From: Moore, Anthony W 
Sent: Tuesday, August 19, 2014 8:40 AM
To: Song, Ruiling; beignet at lists.freedesktop.org
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Ruiling-
Thanks for your response. Yes, that makes sense. That problem did not occur to me since I saw the utests pass after I tried the change, but understand that some tests could fail. In regards to the DWORD vs BYTE loading performance you speak about, I was curious about the performance and can confirm that 1 byte gather is much slower than DWORD.
The performance testing I've been doing has been comparing Beignet and Intel's closed source driver. We're seeing much better performance of Intel's driver on Haswell using a custom set of kernels available in OpenCV. I've been focusing on a very simple kernel (RGB2Gray) where Beignet takes 2x longer and suspect that it's the 3 byte loads that's contributing to the slow down. I reduced this to a single load as an experiment and we came very close to Intel's driver. My suspicion is this is the case for many kernels which is why I'm trying to combine loads where possible. I'm wondering how difficult it would be to adapt the readByteAsDWord to extract multiple bytes by reading successive DWORDS or even doing an unaligned oblock read.
Another experiment we did was to compare the available flops using a kernel found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver came very close to the theoretical GFLOPs on HSW and BYT with float16, but Beignet was much lower. Looking at the IR/asm it seems that it's not taking advantage of SIMD. Does the backend always split up vectors or only in certain situations?
Thanks,
Tony

-----Original Message-----
From: Song, Ruiling 
Sent: Monday, August 18, 2014 7:50 PM
To: Moore, Anthony W; beignet at lists.freedesktop.org
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Hi Tony,

In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in backend.
Currently if you only change the logic in llvm_loadstore_optimization.cpp to make byte read/write merged, you may get wrong result if the starting address of merged memory access is not-4-byte-aligned.
The later steps will simply treat 4 byte load as 1 int load (int load always need 4-byte-aligned address).
And on Gen7, int load is much better than byte load. So you will see significant 

See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
	// read 4 byte as 1 int and unpack it, here starting address must be 4-byte-aligned } else {
  GBE_ASSERT(insn.getValueNum() == 1);
  // read 1 int and extract actual byte using some logic-shift
  // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
}
I am not sure if I explain it clearly.

Could you share me more details about your test? which OpenCV kernels or related performance test in OpenCV? So I could do some performance testing.
I am not sure if you meet something like vload4(int offset, uchar * p)? OpenCL spec does not ensure the address 'p' is 4-byte-aligned.
If it is a uchar4* read/write, things will be different, the address is 4-byte-aligned. And the performance is much better than vload4 of uchar* in beignet.

Thanks!
Ruiling

-----Original Message-----
From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of Moore, Anthony W
Sent: Monday, August 18, 2014 11:47 PM
To: beignet at lists.freedesktop.org
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Hi,

For this patch http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit and saw some significant performance improvement with some of OpenCV's kernels.

+        // we only support DWORD data type merge
+        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;

Thanks!
Tony
_______________________________________________
Beignet mailing list
Beignet at lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet