[Beignet] Combine Loads from __constant space

Wed Nov 26 05:30:16 PST 2014

Hello,
I'm actually using Haswell for these experiments. I modified the
benchmark_run app to use read-only and constant memory and the biggest
improvement was with small vectors of uint size. I'm guessing the loss in
the performance with larger vectors is because they are not being combined.
Some sizes did do worse. Attached logs.

constant | global

vload_bench_uint() vload_bench_uint()
  Vector size 2:  Vector size 2:
Offset 0 : 58.2GB/S      | Offset 0 : 11.7GB/S
Offset 1 : 59.6GB/S      | Offset 1 : 10.7GB/S
  Vector size 3:  Vector size 3:
Offset 0 : 34.3GB/S      | Offset 0 : 7.6GB/S
Offset 1 : 34.3GB/S      | Offset 1 : 7.6GB/S
Offset 2 : 34.3GB/S      | Offset 2 : 7.8GB/S
  Vector size 4:  Vector size 4:
Offset 0 : 28.1GB/S      | Offset 0 : 12.6GB/S
Offset 1 : 28.1GB/S      | Offset 1 : 10.4GB/S
Offset 2 : 28.1GB/S      | Offset 2 : 10.3GB/S
Offset 3 : 28.1GB/S      | Offset 3 : 10.2GB/S

On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong <zhigang.gong at linux.intel.com>
wrote:

> I guess Tony is using BayTrail, so the constant cache(RO)
> is just half of the standard IvyBridge's. Actually the cache
> influnce is highly related to the access pattern and not
> highly related to the memory size.
>
> If the access always has big stride then the performance
> will not be good even use less than the constant cache size,
> as the cache and memory mapping is not 1:1. It may still
> cause cache replacement if two constant buffer conflicts
> on the same cache bank.
>
> If the access locality is good, then even if you use a
> very large amount of contant cache, the miss rate will
> be relatively low, and the performance will be good.
>
> Another related issue is, according to OpenCL spec:
> CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
>   cl_ulong Max size in bytes of a constant buffer
>   allocation. The minimum value is 64 KB for devices
>   that are not of type CL_DEVICE_TYPE_CUSTOM.
>
> So there is a limitation for the total constant buffer usage.
> Beignet current set it to 512KB. But this is not a hard limitation
> on Gen Platform. We may consider to increase it to a higher threshold.
> Do you have any suggestion?
>
> On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:
> > I am not an expert on the Cache related thing, basically constant cache
> is part of the Read-only cache lies in L3.
> > From the code in src/intel/intel_gpgpu.c, below logic is for IvyBridge:
> > If (slmMode)
> > allocate 64KB constant cache
> > Else
> >          Allocate 32KB constant cache
> >
> > I am not sure is there any big performance difference between less than
> or greater than the real constant cache size in L3.
> > I simply wrote a random-selected number 512KB as the up limit in driver
> API.
> > But it did deserve to investigate the performance change according to
> used constant size.
> > If we use too much constant  larger than the constant cache allocated
> from L3,
> > I think it will definitely cause constant cache data swap in-out
> frequently. Right?
> > If you would like to contribute any performance test to beignet, or any
> other open source test suite, it would be really appreciated!
> >
> > Thanks!
> > Ruiling
> > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf
> Of Tony Moore
> > Sent: Wednesday, November 26, 2014 6:45 AM
> > To: beignet at lists.freedesktop.org
> > Subject: Re: [Beignet] Combine Loads from __constant space
> >
> > Another question I had about __constant, was there seems to be no limit.
> I'm using __constant for every read-only parameter now totalling 1500Kb and
> this test now runs in 32ms. So, is there a limit? Is this method reliable?
> Can driver do this implicitly on all read-only buffers?
> > thanks
> >
> > On Tue Nov 25 2014 at 2:11:26 PM Tony Moore <tonywmnix at gmail.com<mailto:
> tonywmnix at gmail.com>> wrote:
> > Hello,
> > I notice that reads are not being combined when I use __constant on a
> read-only kernel buffer. Is this something that can be improved?
> >
> > In my kernel there are many loads from a read-only data structure. When
> I use the __global specifier for the memory space I see a total of 33 send
> instructions and a runtime of 81ms. When I use the __constant specifier, I
> see 43 send instructions and a runtime of 40ms. I'm hoping that combining
> the loads could improve performance further.
> >
> > thanks!
> > tony
>
> > _______________________________________________
> > Beignet mailing list
> > Beignet at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/beignet/attachments/20141126/13d5a8a6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmark_run.log
Type: application/octet-stream
Size: 3910 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/beignet/attachments/20141126/13d5a8a6/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmark_run_const.log
Type: application/octet-stream
Size: 3907 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/beignet/attachments/20141126/13d5a8a6/attachment-0001.obj>