[Beignet] Combine Loads from __constant space

Wed Nov 26 06:41:21 PST 2014

The load combination optimization is really not very helpful for most
uint vector.
You can easily disable this optimization and try the vload_bench_uint
with global buffer.
It should only get some gain with the uint3 vector. (open file
llvm_to_gen.cpp and comment the
passes.add(createLoadStoreOptimizationsPass();)

And if you are using Haswell, then it seems that there are some DC
configuration issues
on your platform. It should not be so slower than the constant buffer
under uint2 load.
on uint2 load, the cache locality is very good, and almost all data is
from L3 cache.
And the cache speed should be much faster than 11.7GB/s. I tested it
on my Haswell
machine, uint2 performance with global buffer is more than 80GB/s.

On Wed, Nov 26, 2014 at 9:30 PM, Tony Moore <tonywmnix at gmail.com> wrote:
> Hello,
> I'm actually using Haswell for these experiments. I modified the
> benchmark_run app to use read-only and constant memory and the biggest
> improvement was with small vectors of uint size. I'm guessing the loss in
> the performance with larger vectors is because they are not being combined.
> Some sizes did do worse. Attached logs.
>
> constant | global
>
> vload_bench_uint() vload_bench_uint()
>   Vector size 2:  Vector size 2:
> Offset 0 : 58.2GB/S      | Offset 0 : 11.7GB/S
> Offset 1 : 59.6GB/S      | Offset 1 : 10.7GB/S
>   Vector size 3:  Vector size 3:
> Offset 0 : 34.3GB/S      | Offset 0 : 7.6GB/S
> Offset 1 : 34.3GB/S      | Offset 1 : 7.6GB/S
> Offset 2 : 34.3GB/S      | Offset 2 : 7.8GB/S
>   Vector size 4:  Vector size 4:
> Offset 0 : 28.1GB/S      | Offset 0 : 12.6GB/S
> Offset 1 : 28.1GB/S      | Offset 1 : 10.4GB/S
> Offset 2 : 28.1GB/S      | Offset 2 : 10.3GB/S
> Offset 3 : 28.1GB/S      | Offset 3 : 10.2GB/S
>
>
> On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong <zhigang.gong at linux.intel.com>
> wrote:
>>
>> I guess Tony is using BayTrail, so the constant cache(RO)
>> is just half of the standard IvyBridge's. Actually the cache
>> influnce is highly related to the access pattern and not
>> highly related to the memory size.
>>
>> If the access always has big stride then the performance
>> will not be good even use less than the constant cache size,
>> as the cache and memory mapping is not 1:1. It may still
>> cause cache replacement if two constant buffer conflicts
>> on the same cache bank.
>>
>> If the access locality is good, then even if you use a
>> very large amount of contant cache, the miss rate will
>> be relatively low, and the performance will be good.
>>
>> Another related issue is, according to OpenCL spec:
>> CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
>>   cl_ulong Max size in bytes of a constant buffer
>>   allocation. The minimum value is 64 KB for devices
>>   that are not of type CL_DEVICE_TYPE_CUSTOM.
>>
>> So there is a limitation for the total constant buffer usage.
>> Beignet current set it to 512KB. But this is not a hard limitation
>> on Gen Platform. We may consider to increase it to a higher threshold.
>> Do you have any suggestion?
>>
>> On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:
>> > I am not an expert on the Cache related thing, basically constant cache
>> > is part of the Read-only cache lies in L3.
>> > From the code in src/intel/intel_gpgpu.c, below logic is for IvyBridge:
>> > If (slmMode)
>> > allocate 64KB constant cache
>> > Else
>> >          Allocate 32KB constant cache
>> >
>> > I am not sure is there any big performance difference between less than
>> > or greater than the real constant cache size in L3.
>> > I simply wrote a random-selected number 512KB as the up limit in driver
>> > API.
>> > But it did deserve to investigate the performance change according to
>> > used constant size.
>> > If we use too much constant  larger than the constant cache allocated
>> > from L3,
>> > I think it will definitely cause constant cache data swap in-out
>> > frequently. Right?
>> > If you would like to contribute any performance test to beignet, or any
>> > other open source test suite, it would be really appreciated!
>> >
>> > Thanks!
>> > Ruiling
>> > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf
>> > Of Tony Moore
>> > Sent: Wednesday, November 26, 2014 6:45 AM
>> > To: beignet at lists.freedesktop.org
>> > Subject: Re: [Beignet] Combine Loads from __constant space
>> >
>> > Another question I had about __constant, was there seems to be no limit.
>> > I'm using __constant for every read-only parameter now totalling 1500Kb and
>> > this test now runs in 32ms. So, is there a limit? Is this method reliable?
>> > Can driver do this implicitly on all read-only buffers?
>> > thanks
>> >
>> > On Tue Nov 25 2014 at 2:11:26 PM Tony Moore
>> > <tonywmnix at gmail.com<mailto:tonywmnix at gmail.com>> wrote:
>> > Hello,
>> > I notice that reads are not being combined when I use __constant on a
>> > read-only kernel buffer. Is this something that can be improved?
>> >
>> > In my kernel there are many loads from a read-only data structure. When
>> > I use the __global specifier for the memory space I see a total of 33 send
>> > instructions and a runtime of 81ms. When I use the __constant specifier, I
>> > see 43 send instructions and a runtime of 40ms. I'm hoping that combining
>> > the loads could improve performance further.
>> >
>> > thanks!
>> > tony
>>
>> > _______________________________________________
>> > Beignet mailing list
>> > Beignet at lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/beignet
>>
>
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
>