[Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

Mon Oct 30 06:44:42 UTC 2017

On Sun, Oct 29, 2017 at 11:17:11PM +0100, Chema Casanova wrote:
> On 29/10/17 19:55, Pohjolainen, Topi wrote:
> > On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
> >> We enable the use of 16-bit values in push constants
> >> modifying the assign_constant_locations function to work
> >> with 16-bit types.
> >>
> >> The API to access buffers in Vulkan use multiples of 4-byte for
> >> offsets and sizes. Current accountability of uniforms based on 4-byte
> >> slots will work for 16-bit values if they are allowed to use 32-bit
> >> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
> >> 2-byte elements will use 1 slot instead of 0.
> >>
> >> We aligns the 16-bit locations after assigning the 32-bit
> >> ones.
> >> ---
> >>  src/intel/compiler/brw_fs.cpp | 30 +++++++++++++++++++++++-------
> >>  1 file changed, 23 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> >> index a1d49a63be..8da16145dc 100644
> >> --- a/src/intel/compiler/brw_fs.cpp
> >> +++ b/src/intel/compiler/brw_fs.cpp
> >> @@ -1909,8 +1909,9 @@ set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
> >>     if (!contiguous) {
> >>        /* If bitsize doesn't match the target one, skip it */
> >>        if (*max_chunk_bitsize != target_bitsize) {
> >> -         /* FIXME: right now we only support 32 and 64-bit accesses */
> >> -         assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
> >> +         assert(*max_chunk_bitsize == 4 ||
> >> +                *max_chunk_bitsize == 8 ||
> >> +                *max_chunk_bitsize == 2);
> >>           *max_chunk_bitsize = 0;
> >>           *chunk_start = -1;
> >>           return;
> >> @@ -1987,8 +1988,9 @@ fs_visitor::assign_constant_locations()
> >>           int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
> > 
> > Did you test this with, for example, vec4?
> 
> CTS has 16bit scalar, vec2 (uint,sint), vec4 (float) and matrix tests
> for push constants for compute and graphics pipelines. For vec4 you can try:
> 
> dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant_16_to_32.vector_float
> 
> For push constant tests in general there are 42 tests, but vec3 aren't
> tested:
> 
> dEQP-VK.*16bit_storage.*push_constant.
> 
> 
> > I've been toying with a glsl
> > lowering pass changing mediump floats into float16. I was curious to know how
> > much is needed as you have addressed most of the things from NIR onwards.
> > Here I'm seeing offsets 0,2,4,6 which result into 0,0,1,1 when divided by
> > four. Don't we need something of this sort in addition?
> 
> If i remember correctly, tests were testing to use push constants with
> 64 16bit values, to use the minimum spec maximum available as
> max_push_constants_size that is 128 bytes. So at the end the generated
> intrinsic was:
> 
> vec4 16 ssa_4 = intrinsic load_uniform (ssa_3) () (0, 128) /* base=0 */
> /* range=128 */
> 
> As the calculus here is to calculate the number of location used, and
> taking into account that the Vulkan API restrictions for push constants
> that says that push constant ranges that say that offset must be
> multiple of 4 and size must be multiple of 4, maintain the use of
> 4-bytes slots was ok for supporting the feature. Our code changes just
> take the accountability in the number of 32-bits location needed, mainly
> changing the divisions by 4 using DIV_ROUND_UP( , 4) to calculate sizes.

I'm probably misunderstanding something. Let me ask a few clarifying questions.

I'm reading that the incoming 16-bit values are given in 32-bit slots, and for
the same reason we place them in the push/pull buffers in 32-bits slots. In
other words a vec4 would take 16-bytes and each component would 32-bits apart?

If that is the case, then don't we need to adjust the register offsets
somewhere the way I did in the fragment below? Otherwise the offsets will
point to locations in the register that are simply 16-bits apart?

> 
> > commit 1a6d2bf3302f6e4305e383da0f27712dc5c20a67
> > Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
> > Date:   Sun Oct 29 20:28:03 2017 +0200
> > 
> >     fix alignment of 16-bit uniforms on 32-bit slots
> > 
> > diff --git a/src/intel/compiler/brw_fs_nir.cpp b/src/intel/compiler/brw_fs_nir.cpp
> > index 2f5443958a..586eb9d9ff 100644
> > --- a/src/intel/compiler/brw_fs_nir.cpp
> > +++ b/src/intel/compiler/brw_fs_nir.cpp
> > @@ -4007,7 +4007,10 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
> >           src.offset = const_offset->u32[0];
> >  
> >           for (unsigned j = 0; j < instr->num_components; j++) {
> > -            bld.MOV(offset(dest, bld, j), offset(src, bld, j));
> > +            const unsigned src_offset =
> > +              src.type == BRW_REGISTER_TYPE_HF ? 2 * j : j;
> > +
> > +            bld.MOV(offset(dest, bld, j), offset(src, bld, src_offset));
> > 
> > 
> > 
> > Then about the change of using 32-bit slots. This is now unconditional and
> > would require revisiting if we wanted to pack 16-bits tighter and possibly
> > increase the amount of uniforms that can be pushed. Similarly to Vulkan, in
> > GL the core stores uniforms as floats and I think we should keep it that way.
> 
> > I added support in the i965 backend to keep track of the types of the
> > uniforms and to convert 32-bit presentation to 16-bits on the fly in
> > gen6_constant_state.c::brw_param_value(). I don't like it that much but I had
> > to start from somewhere.
> 
> > My thinking is that we'd want to decouple the storage of the values and the
> > packing used in the compiler backend. Ideally keeping the mesa gl core and the
> > api working with full 32-bit floats but using tight 16-bit slots in the
> > push/pull constant buffers.
> > This requires quite a bit more changes as we have structured
> > param[]/pull_param[] to work with 32-bit slots.
> 
> At the beginning we though that we need to change to do that for VK
> 16-bit values on push constants, but in this case is client of the API
> the one that has control of how values are pushed. And it is the
> limitation related to offset multiples of 4 the one that simplifies the
> logic enabling the use of 32-bit slots.
> 
> But as you comment in the case of uniforms, it would make sense to have
> an accountability to a level of byte or two byte-slot. But I think that
> it will only improve the situation for 16-bit scalar uniforms. Current
> push model will use two 32-bit slots for two scalar 16-bit uniforms when
> we could use half of that space with a byte level accountability. But in
> the case of vec2 and vec4 current model is working fine. Also in the
> case of array-of 16-bit scalars the consequence is just losing half-slot
> in the case of a last odd element.

Such as GL, I'm seeing how Vulkan tells how the storage and manipulation of
push constants should be. But that doesn't dictate how driver backend
chooses to compile the shader and how to construct the push/pull buffers
given to the hardware.

In Anvil anv_cmd_buffer.c::anv_cmd_buffer_push_constants() allocates space
for the upload, consults the storage and can place the values in the
upload buffer independent of the storage. Such as I described for the
GL backend, see gen6_constant_state.c::brw_param_value().

Or am I missing something else?

> 
> > My current work can be found in:
> > 
> > git://people.freedesktop.org/~tpohjola/mesa 16_bit_gles
> 
> I've bisected the code because of a regression of the 16bit_storage
> push_constant test is caused at this commit.

Right, that may easily break things - all that I have is just enough to
run a few piglit tests in tests/spec/glsl-es-1.00/execution/ that I modified
to use uniforms/varyings/math and a couple of benchmarks. I'm trying to get
a general idea how much work there is on top of yours to support gles.

> 
> commit a8d43602174fa6062839c568690b06c6fab4f2d1 (HEAD, refs/bisect/bad)
> Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
> Date:   Thu Oct 26 20:16:42 2017 +0300
> 
>     i965: Add support for uploading 16-bit uniforms from 32-bit store
> 
>     At this point 16-bit uniforms still take full 32-bit slots in the
>     pull/push constant buffers and in shader deployment payload.
> 
>     Signed-off-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
> 
> 
> 
> You would need to include in your branch to test the 16bit_storage push
> constant tests the last commit of our series.
> 
> anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+

I haven't been running Vulkan with my tree - I only focused on gles shaders.
There are a few benchmarks which may benefit from 16-bits.