[Beignet] Combine Loads from __constant space

Tony Moore tonywmnix at gmail.com
Wed Nov 26 07:08:55 PST 2014


Thanks for the response. Glad to hear there's a possibility for better
performance! If you have similar data on Baytrail/Ivybridge please share
with the list so we can compare against our results.
tony

On Wed Nov 26 2014 at 7:41:21 AM Zhigang Gong <zhigang.gong at gmail.com>
wrote:

> The load combination optimization is really not very helpful for most
> uint vector.
> You can easily disable this optimization and try the vload_bench_uint
> with global buffer.
> It should only get some gain with the uint3 vector. (open file
> llvm_to_gen.cpp and comment the
> passes.add(createLoadStoreOptimizationsPass();)
>
> And if you are using Haswell, then it seems that there are some DC
> configuration issues
> on your platform. It should not be so slower than the constant buffer
> under uint2 load.
> on uint2 load, the cache locality is very good, and almost all data is
> from L3 cache.
> And the cache speed should be much faster than 11.7GB/s. I tested it
> on my Haswell
> machine, uint2 performance with global buffer is more than 80GB/s.
>
> On Wed, Nov 26, 2014 at 9:30 PM, Tony Moore <tonywmnix at gmail.com> wrote:
> > Hello,
> > I'm actually using Haswell for these experiments. I modified the
> > benchmark_run app to use read-only and constant memory and the biggest
> > improvement was with small vectors of uint size. I'm guessing the loss in
> > the performance with larger vectors is because they are not being
> combined.
> > Some sizes did do worse. Attached logs.
> >
> > constant | global
> >
> > vload_bench_uint() vload_bench_uint()
> >   Vector size 2:  Vector size 2:
> > Offset 0 : 58.2GB/S      | Offset 0 : 11.7GB/S
> > Offset 1 : 59.6GB/S      | Offset 1 : 10.7GB/S
> >   Vector size 3:  Vector size 3:
> > Offset 0 : 34.3GB/S      | Offset 0 : 7.6GB/S
> > Offset 1 : 34.3GB/S      | Offset 1 : 7.6GB/S
> > Offset 2 : 34.3GB/S      | Offset 2 : 7.8GB/S
> >   Vector size 4:  Vector size 4:
> > Offset 0 : 28.1GB/S      | Offset 0 : 12.6GB/S
> > Offset 1 : 28.1GB/S      | Offset 1 : 10.4GB/S
> > Offset 2 : 28.1GB/S      | Offset 2 : 10.3GB/S
> > Offset 3 : 28.1GB/S      | Offset 3 : 10.2GB/S
> >
> >
> > On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong <
> zhigang.gong at linux.intel.com>
> > wrote:
> >>
> >> I guess Tony is using BayTrail, so the constant cache(RO)
> >> is just half of the standard IvyBridge's. Actually the cache
> >> influnce is highly related to the access pattern and not
> >> highly related to the memory size.
> >>
> >> If the access always has big stride then the performance
> >> will not be good even use less than the constant cache size,
> >> as the cache and memory mapping is not 1:1. It may still
> >> cause cache replacement if two constant buffer conflicts
> >> on the same cache bank.
> >>
> >> If the access locality is good, then even if you use a
> >> very large amount of contant cache, the miss rate will
> >> be relatively low, and the performance will be good.
> >>
> >> Another related issue is, according to OpenCL spec:
> >> CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
> >>   cl_ulong Max size in bytes of a constant buffer
> >>   allocation. The minimum value is 64 KB for devices
> >>   that are not of type CL_DEVICE_TYPE_CUSTOM.
> >>
> >> So there is a limitation for the total constant buffer usage.
> >> Beignet current set it to 512KB. But this is not a hard limitation
> >> on Gen Platform. We may consider to increase it to a higher threshold.
> >> Do you have any suggestion?
> >>
> >> On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:
> >> > I am not an expert on the Cache related thing, basically constant
> cache
> >> > is part of the Read-only cache lies in L3.
> >> > From the code in src/intel/intel_gpgpu.c, below logic is for
> IvyBridge:
> >> > If (slmMode)
> >> > allocate 64KB constant cache
> >> > Else
> >> >          Allocate 32KB constant cache
> >> >
> >> > I am not sure is there any big performance difference between less
> than
> >> > or greater than the real constant cache size in L3.
> >> > I simply wrote a random-selected number 512KB as the up limit in
> driver
> >> > API.
> >> > But it did deserve to investigate the performance change according to
> >> > used constant size.
> >> > If we use too much constant  larger than the constant cache allocated
> >> > from L3,
> >> > I think it will definitely cause constant cache data swap in-out
> >> > frequently. Right?
> >> > If you would like to contribute any performance test to beignet, or
> any
> >> > other open source test suite, it would be really appreciated!
> >> >
> >> > Thanks!
> >> > Ruiling
> >> > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On
> Behalf
> >> > Of Tony Moore
> >> > Sent: Wednesday, November 26, 2014 6:45 AM
> >> > To: beignet at lists.freedesktop.org
> >> > Subject: Re: [Beignet] Combine Loads from __constant space
> >> >
> >> > Another question I had about __constant, was there seems to be no
> limit.
> >> > I'm using __constant for every read-only parameter now totalling
> 1500Kb and
> >> > this test now runs in 32ms. So, is there a limit? Is this method
> reliable?
> >> > Can driver do this implicitly on all read-only buffers?
> >> > thanks
> >> >
> >> > On Tue Nov 25 2014 at 2:11:26 PM Tony Moore
> >> > <tonywmnix at gmail.com<mailto:tonywmnix at gmail.com>> wrote:
> >> > Hello,
> >> > I notice that reads are not being combined when I use __constant on a
> >> > read-only kernel buffer. Is this something that can be improved?
> >> >
> >> > In my kernel there are many loads from a read-only data structure.
> When
> >> > I use the __global specifier for the memory space I see a total of 33
> send
> >> > instructions and a runtime of 81ms. When I use the __constant
> specifier, I
> >> > see 43 send instructions and a runtime of 40ms. I'm hoping that
> combining
> >> > the loads could improve performance further.
> >> >
> >> > thanks!
> >> > tony
> >>
> >> > _______________________________________________
> >> > Beignet mailing list
> >> > Beignet at lists.freedesktop.org
> >> > http://lists.freedesktop.org/mailman/listinfo/beignet
> >>
> >
> > _______________________________________________
> > Beignet mailing list
> > Beignet at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/beignet/attachments/20141126/1b346056/attachment.html>


More information about the Beignet mailing list