Thanks for the response. Glad to hear there's a possibility for better performance! If you have similar data on Baytrail/Ivybridge please share with the list so we can compare against our results. <div>tony<br><br><div class="gmail_quote">On Wed Nov 26 2014 at 7:41:21 AM Zhigang Gong <<a href="mailto:zhigang.gong@gmail.com">zhigang.gong@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The load combination optimization is really not very helpful for most<br>
uint vector.<br>
You can easily disable this optimization and try the vload_bench_uint<br>
with global buffer.<br>
It should only get some gain with the uint3 vector. (open file<br>
llvm_to_gen.cpp and comment the<br>
passes.add(<u></u>createLoadStoreOptimizationsPa<u></u>ss();)<br>
<br>
And if you are using Haswell, then it seems that there are some DC<br>
configuration issues<br>
on your platform. It should not be so slower than the constant buffer<br>
under uint2 load.<br>
on uint2 load, the cache locality is very good, and almost all data is<br>
from L3 cache.<br>
And the cache speed should be much faster than 11.7GB/s. I tested it<br>
on my Haswell<br>
machine, uint2 performance with global buffer is more than 80GB/s.<br>
<br>
On Wed, Nov 26, 2014 at 9:30 PM, Tony Moore <<a href="mailto:tonywmnix@gmail.com" target="_blank">tonywmnix@gmail.com</a>> wrote:<br>
> Hello,<br>
> I'm actually using Haswell for these experiments. I modified the<br>
> benchmark_run app to use read-only and constant memory and the biggest<br>
> improvement was with small vectors of uint size. I'm guessing the loss in<br>
> the performance with larger vectors is because they are not being combined.<br>
> Some sizes did do worse. Attached logs.<br>
><br>
> constant | global<br>
><br>
> vload_bench_uint() vload_bench_uint()<br>
> Vector size 2: Vector size 2:<br>
> Offset 0 : 58.2GB/S | Offset 0 : 11.7GB/S<br>
> Offset 1 : 59.6GB/S | Offset 1 : 10.7GB/S<br>
> Vector size 3: Vector size 3:<br>
> Offset 0 : 34.3GB/S | Offset 0 : 7.6GB/S<br>
> Offset 1 : 34.3GB/S | Offset 1 : 7.6GB/S<br>
> Offset 2 : 34.3GB/S | Offset 2 : 7.8GB/S<br>
> Vector size 4: Vector size 4:<br>
> Offset 0 : 28.1GB/S | Offset 0 : 12.6GB/S<br>
> Offset 1 : 28.1GB/S | Offset 1 : 10.4GB/S<br>
> Offset 2 : 28.1GB/S | Offset 2 : 10.3GB/S<br>
> Offset 3 : 28.1GB/S | Offset 3 : 10.2GB/S<br>
><br>
><br>
> On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong <<a href="mailto:zhigang.gong@linux.intel.com" target="_blank">zhigang.gong@linux.intel.com</a>><br>
> wrote:<br>
>><br>
>> I guess Tony is using BayTrail, so the constant cache(RO)<br>
>> is just half of the standard IvyBridge's. Actually the cache<br>
>> influnce is highly related to the access pattern and not<br>
>> highly related to the memory size.<br>
>><br>
>> If the access always has big stride then the performance<br>
>> will not be good even use less than the constant cache size,<br>
>> as the cache and memory mapping is not 1:1. It may still<br>
>> cause cache replacement if two constant buffer conflicts<br>
>> on the same cache bank.<br>
>><br>
>> If the access locality is good, then even if you use a<br>
>> very large amount of contant cache, the miss rate will<br>
>> be relatively low, and the performance will be good.<br>
>><br>
>> Another related issue is, according to OpenCL spec:<br>
>> CL_DEVICE_MAX_CONSTANT_BUFFER_<u></u>SIZE<br>
>> cl_ulong Max size in bytes of a constant buffer<br>
>> allocation. The minimum value is 64 KB for devices<br>
>> that are not of type CL_DEVICE_TYPE_CUSTOM.<br>
>><br>
>> So there is a limitation for the total constant buffer usage.<br>
>> Beignet current set it to 512KB. But this is not a hard limitation<br>
>> on Gen Platform. We may consider to increase it to a higher threshold.<br>
>> Do you have any suggestion?<br>
>><br>
>> On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:<br>
>> > I am not an expert on the Cache related thing, basically constant cache<br>
>> > is part of the Read-only cache lies in L3.<br>
>> > From the code in src/intel/intel_gpgpu.c, below logic is for IvyBridge:<br>
>> > If (slmMode)<br>
>> > allocate 64KB constant cache<br>
>> > Else<br>
>> > Allocate 32KB constant cache<br>
>> ><br>
>> > I am not sure is there any big performance difference between less than<br>
>> > or greater than the real constant cache size in L3.<br>
>> > I simply wrote a random-selected number 512KB as the up limit in driver<br>
>> > API.<br>
>> > But it did deserve to investigate the performance change according to<br>
>> > used constant size.<br>
>> > If we use too much constant larger than the constant cache allocated<br>
>> > from L3,<br>
>> > I think it will definitely cause constant cache data swap in-out<br>
>> > frequently. Right?<br>
>> > If you would like to contribute any performance test to beignet, or any<br>
>> > other open source test suite, it would be really appreciated!<br>
>> ><br>
>> > Thanks!<br>
>> > Ruiling<br>
>> > From: Beignet [mailto:<a href="mailto:beignet-bounces@lists.freedesktop.org" target="_blank">beignet-bounces@lists.<u></u>freedesktop.org</a>] On Behalf<br>
>> > Of Tony Moore<br>
>> > Sent: Wednesday, November 26, 2014 6:45 AM<br>
>> > To: <a href="mailto:beignet@lists.freedesktop.org" target="_blank">beignet@lists.freedesktop.org</a><br>
>> > Subject: Re: [Beignet] Combine Loads from __constant space<br>
>> ><br>
>> > Another question I had about __constant, was there seems to be no limit.<br>
>> > I'm using __constant for every read-only parameter now totalling 1500Kb and<br>
>> > this test now runs in 32ms. So, is there a limit? Is this method reliable?<br>
>> > Can driver do this implicitly on all read-only buffers?<br>
>> > thanks<br>
>> ><br>
>> > On Tue Nov 25 2014 at 2:11:26 PM Tony Moore<br>
>> > <<a href="mailto:tonywmnix@gmail.com" target="_blank">tonywmnix@gmail.com</a><mailto:<a href="mailto:tonywmnix@gmail.com" target="_blank">to<u></u>nywmnix@gmail.com</a>>> wrote:<br>
>> > Hello,<br>
>> > I notice that reads are not being combined when I use __constant on a<br>
>> > read-only kernel buffer. Is this something that can be improved?<br>
>> ><br>
>> > In my kernel there are many loads from a read-only data structure. When<br>
>> > I use the __global specifier for the memory space I see a total of 33 send<br>
>> > instructions and a runtime of 81ms. When I use the __constant specifier, I<br>
>> > see 43 send instructions and a runtime of 40ms. I'm hoping that combining<br>
>> > the loads could improve performance further.<br>
>> ><br>
>> > thanks!<br>
>> > tony<br>
>><br>
>> > ______________________________<u></u>_________________<br>
>> > Beignet mailing list<br>
>> > <a href="mailto:Beignet@lists.freedesktop.org" target="_blank">Beignet@lists.freedesktop.org</a><br>
>> > <a href="http://lists.freedesktop.org/mailman/listinfo/beignet" target="_blank">http://lists.freedesktop.org/<u></u>mailman/listinfo/beignet</a><br>
>><br>
><br>
> ______________________________<u></u>_________________<br>
> Beignet mailing list<br>
> <a href="mailto:Beignet@lists.freedesktop.org" target="_blank">Beignet@lists.freedesktop.org</a><br>
> <a href="http://lists.freedesktop.org/mailman/listinfo/beignet" target="_blank">http://lists.freedesktop.org/<u></u>mailman/listinfo/beignet</a><br>
><br>
</blockquote></div></div>