[Beignet] Combine Loads from __constant space

Tue Nov 25 19:51:57 PST 2014

I guess Tony is using BayTrail, so the constant cache(RO)
is just half of the standard IvyBridge's. Actually the cache
influnce is highly related to the access pattern and not
highly related to the memory size.

If the access always has big stride then the performance
will not be good even use less than the constant cache size,
as the cache and memory mapping is not 1:1. It may still
cause cache replacement if two constant buffer conflicts
on the same cache bank.

If the access locality is good, then even if you use a
very large amount of contant cache, the miss rate will
be relatively low, and the performance will be good.

Another related issue is, according to OpenCL spec:
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
  cl_ulong Max size in bytes of a constant buffer
  allocation. The minimum value is 64 KB for devices
  that are not of type CL_DEVICE_TYPE_CUSTOM.

So there is a limitation for the total constant buffer usage.
Beignet current set it to 512KB. But this is not a hard limitation
on Gen Platform. We may consider to increase it to a higher threshold.
Do you have any suggestion?

On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote:
> I am not an expert on the Cache related thing, basically constant cache is part of the Read-only cache lies in L3.
> From the code in src/intel/intel_gpgpu.c, below logic is for IvyBridge:
> If (slmMode)
> allocate 64KB constant cache
> Else
>          Allocate 32KB constant cache
> 
> I am not sure is there any big performance difference between less than or greater than the real constant cache size in L3.
> I simply wrote a random-selected number 512KB as the up limit in driver API.
> But it did deserve to investigate the performance change according to used constant size.
> If we use too much constant  larger than the constant cache allocated from L3,
> I think it will definitely cause constant cache data swap in-out frequently. Right?
> If you would like to contribute any performance test to beignet, or any other open source test suite, it would be really appreciated!
> 
> Thanks!
> Ruiling
> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of Tony Moore
> Sent: Wednesday, November 26, 2014 6:45 AM
> To: beignet at lists.freedesktop.org
> Subject: Re: [Beignet] Combine Loads from __constant space
> 
> Another question I had about __constant, was there seems to be no limit. I'm using __constant for every read-only parameter now totalling 1500Kb and this test now runs in 32ms. So, is there a limit? Is this method reliable? Can driver do this implicitly on all read-only buffers?
> thanks
> 
> On Tue Nov 25 2014 at 2:11:26 PM Tony Moore <tonywmnix at gmail.com<mailto:tonywmnix at gmail.com>> wrote:
> Hello,
> I notice that reads are not being combined when I use __constant on a read-only kernel buffer. Is this something that can be improved?
> 
> In my kernel there are many loads from a read-only data structure. When I use the __global specifier for the memory space I see a total of 33 send instructions and a runtime of 81ms. When I use the __constant specifier, I see 43 send instructions and a runtime of 40ms. I'm hoping that combining the loads could improve performance further.
> 
> thanks!
> tony

> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet