<div>Hello,</div><div>I'm actually using Haswell for these experiments. I modified the benchmark_run app to use read-only and constant memory and the biggest improvement was with small vectors of uint size. I'm guessing the loss in the performance with larger vectors is because they are not being combined. Some sizes did do worse. Attached logs.</div><div><br></div><div>constant | global</div><div><br></div><div>vload_bench_uint()<span class="Apple-tab-span" style="white-space:pre"> </span>vload_bench_uint()</div><div> Vector size 2:<span class="Apple-tab-span" style="white-space:pre"> </span> Vector size 2:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>58.2GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>11.7GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>59.6GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>10.7GB/S</div><div> Vector size 3:<span class="Apple-tab-span" style="white-space:pre"> </span> Vector size 3:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>34.3GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>7.6GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>34.3GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>7.6GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 2 :<span class="Apple-tab-span" style="white-space:pre"> </span>34.3GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 2 :<span class="Apple-tab-span" style="white-space:pre"> </span>7.8GB/S</div><div> Vector size 4:<span class="Apple-tab-span" style="white-space:pre"> </span> Vector size 4:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>28.1GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 0 :<span class="Apple-tab-span" style="white-space:pre"> </span>12.6GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>28.1GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 1 :<span class="Apple-tab-span" style="white-space:pre"> </span>10.4GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 2 :<span class="Apple-tab-span" style="white-space:pre"> </span>28.1GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 2 :<span class="Apple-tab-span" style="white-space:pre"> </span>10.3GB/S</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>Offset 3 :<span class="Apple-tab-span" style="white-space:pre"> </span>28.1GB/S<span class="Apple-tab-span" style="white-space:pre"> </span> |<span class="Apple-tab-span" style="white-space:pre"> </span>Offset 3 :<span class="Apple-tab-span" style="white-space:pre"> </span>10.2GB/S</div><div> </div><br><div class="gmail_quote">On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong <<a href="mailto:zhigang.gong@linux.intel.com">zhigang.gong@linux.intel.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I guess Tony is using BayTrail, so the constant cache(RO)<br>
is just half of the standard IvyBridge's. Actually the cache<br>
influnce is highly related to the access pattern and not<br>
highly related to the memory size.<br>
<br>
If the access always has big stride then the performance<br>
will not be good even use less than the constant cache size,<br>
as the cache and memory mapping is not 1:1. It may still<br>
cause cache replacement if two constant buffer conflicts<br>
on the same cache bank.<br>
<br>
If the access locality is good, then even if you use a<br>
very large amount of contant cache, the miss rate will<br>
be relatively low, and the performance will be good.<br>
<br>
Another related issue is, according to OpenCL spec:<br>
CL_DEVICE_MAX_CONSTANT_BUFFER_<u></u>SIZE<br>
cl_ulong Max size in bytes of a constant buffer<br>
allocation. The minimum value is 64 KB for devices<br>
that are not of type CL_DEVICE_TYPE_CUSTOM.<br>
<br>
So there is a limitation for the total constant buffer usage.<br>
Beignet current set it to 512KB. But this is not a hard limitation<br>
on Gen Platform. We may consider to increase it to a higher threshold.<br>
Do you have any suggestion?<br>
<br>
On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:<br>
> I am not an expert on the Cache related thing, basically constant cache is part of the Read-only cache lies in L3.<br>
> From the code in src/intel/intel_gpgpu.c, below logic is for IvyBridge:<br>
> If (slmMode)<br>
> allocate 64KB constant cache<br>
> Else<br>
> Allocate 32KB constant cache<br>
><br>
> I am not sure is there any big performance difference between less than or greater than the real constant cache size in L3.<br>
> I simply wrote a random-selected number 512KB as the up limit in driver API.<br>
> But it did deserve to investigate the performance change according to used constant size.<br>
> If we use too much constant larger than the constant cache allocated from L3,<br>
> I think it will definitely cause constant cache data swap in-out frequently. Right?<br>
> If you would like to contribute any performance test to beignet, or any other open source test suite, it would be really appreciated!<br>
><br>
> Thanks!<br>
> Ruiling<br>
> From: Beignet [mailto:<a href="mailto:beignet-bounces@lists.freedesktop.org" target="_blank">beignet-bounces@lists.<u></u>freedesktop.org</a>] On Behalf Of Tony Moore<br>
> Sent: Wednesday, November 26, 2014 6:45 AM<br>
> To: <a href="mailto:beignet@lists.freedesktop.org" target="_blank">beignet@lists.freedesktop.org</a><br>
> Subject: Re: [Beignet] Combine Loads from __constant space<br>
><br>
> Another question I had about __constant, was there seems to be no limit. I'm using __constant for every read-only parameter now totalling 1500Kb and this test now runs in 32ms. So, is there a limit? Is this method reliable? Can driver do this implicitly on all read-only buffers?<br>
> thanks<br>
><br>
> On Tue Nov 25 2014 at 2:11:26 PM Tony Moore <<a href="mailto:tonywmnix@gmail.com" target="_blank">tonywmnix@gmail.com</a><mailto:<a href="mailto:tonywmnix@gmail.com" target="_blank">to<u></u>nywmnix@gmail.com</a>>> wrote:<br>
> Hello,<br>
> I notice that reads are not being combined when I use __constant on a read-only kernel buffer. Is this something that can be improved?<br>
><br>
> In my kernel there are many loads from a read-only data structure. When I use the __global specifier for the memory space I see a total of 33 send instructions and a runtime of 81ms. When I use the __constant specifier, I see 43 send instructions and a runtime of 40ms. I'm hoping that combining the loads could improve performance further.<br>
><br>
> thanks!<br>
> tony<br>
<br>
> ______________________________<u></u>_________________<br>
> Beignet mailing list<br>
> <a href="mailto:Beignet@lists.freedesktop.org" target="_blank">Beignet@lists.freedesktop.org</a><br>
> <a href="http://lists.freedesktop.org/mailman/listinfo/beignet" target="_blank">http://lists.freedesktop.org/<u></u>mailman/listinfo/beignet</a><br>
<br>
</blockquote></div>