[Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

Mon Oct 30 17:06:16 UTC 2017

On Mon, Oct 30, 2017 at 05:10:53PM +0100, Chema Casanova wrote:
> El 30/10/17 a las 07:44, Pohjolainen, Topi escribió:
> > On Sun, Oct 29, 2017 at 11:17:11PM +0100, Chema Casanova wrote:
> >> On 29/10/17 19:55, Pohjolainen, Topi wrote:
> >>> On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
> >>>> We enable the use of 16-bit values in push constants
> >>>> modifying the assign_constant_locations function to work
> >>>> with 16-bit types.
> >>>>
> >>>> The API to access buffers in Vulkan use multiples of 4-byte for
> >>>> offsets and sizes. Current accountability of uniforms based on 4-byte
> >>>> slots will work for 16-bit values if they are allowed to use 32-bit
> >>>> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
> >>>> 2-byte elements will use 1 slot instead of 0.
> >>>>
> >>>> We aligns the 16-bit locations after assigning the 32-bit
> >>>> ones.
> >>>> ---
> >>>>  src/intel/compiler/brw_fs.cpp | 30 +++++++++++++++++++++++-------
> >>>>  1 file changed, 23 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> >>>> index a1d49a63be..8da16145dc 100644
> >>>> --- a/src/intel/compiler/brw_fs.cpp
> >>>> +++ b/src/intel/compiler/brw_fs.cpp
> >>>> @@ -1909,8 +1909,9 @@ set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
> >>>>     if (!contiguous) {
> >>>>        /* If bitsize doesn't match the target one, skip it */
> >>>>        if (*max_chunk_bitsize != target_bitsize) {
> >>>> -         /* FIXME: right now we only support 32 and 64-bit accesses */
> >>>> -         assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
> >>>> +         assert(*max_chunk_bitsize == 4 ||
> >>>> +                *max_chunk_bitsize == 8 ||
> >>>> +                *max_chunk_bitsize == 2);
> >>>>           *max_chunk_bitsize = 0;
> >>>>           *chunk_start = -1;
> >>>>           return;
> >>>> @@ -1987,8 +1988,9 @@ fs_visitor::assign_constant_locations()
> >>>>           int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
> >>> Did you test this with, for example, vec4?
> >> CTS has 16bit scalar, vec2 (uint,sint), vec4 (float) and matrix tests
> >> for push constants for compute and graphics pipelines. For vec4 you can try:
> >>
> >> dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant_16_to_32.vector_float
> >>
> >> For push constant tests in general there are 42 tests, but vec3 aren't
> >> tested:
> >>
> >> dEQP-VK.*16bit_storage.*push_constant.
> >>
> >>
> >>> I've been toying with a glsl
> >>> lowering pass changing mediump floats into float16. I was curious to know how
> >>> much is needed as you have addressed most of the things from NIR onwards.
> >>> Here I'm seeing offsets 0,2,4,6 which result into 0,0,1,1 when divided by
> >>> four. Don't we need something of this sort in addition?
> >> If i remember correctly, tests were testing to use push constants with
> >> 64 16bit values, to use the minimum spec maximum available as
> >> max_push_constants_size that is 128 bytes. So at the end the generated
> >> intrinsic was:
> >>
> >> vec4 16 ssa_4 = intrinsic load_uniform (ssa_3) () (0, 128) /* base=0 */
> >> /* range=128 */
> >>
> >> As the calculus here is to calculate the number of location used, and
> >> taking into account that the Vulkan API restrictions for push constants
> >> that says that push constant ranges that say that offset must be
> >> multiple of 4 and size must be multiple of 4, maintain the use of
> >> 4-bytes slots was ok for supporting the feature. Our code changes just
> >> take the accountability in the number of 32-bits location needed, mainly
> >> changing the divisions by 4 using DIV_ROUND_UP( , 4) to calculate sizes.
> > I'm probably misunderstanding something. Let me ask a few clarifying questions.
> >
> > I'm reading that the incoming 16-bit values are given in 32-bit slots, and for
> > the same reason we place them in the push/pull buffers in 32-bits slots. In
> > other words a vec4 would take 16-bytes and each component would 32-bits apart?
> 
> Probably I explained quite bad. A f16vec4 would use 8-bytes, and each
> component
> is going to be 16-bits apart. The 32-bit multiple offset only applies to
> the first
> element.
> 
> > If that is the case, then don't we need to adjust the register offsets
> > somewhere the way I did in the fragment below? Otherwise the offsets will
> > point to locations in the register that are simply 16-bits apart?
> 
> Yes components are 16-bits apart. Because of that we don't need anything
>  especial to adjust the offsets. At least we didn't needed for existing
> tests.

Ah, okay. I misunderstood, thanks for explaining. That sounds good. I need the
hack below for now. I'll come back to it once I have all other things needed
for the benchmarks addressed somehow.

> 
> >
> >>> commit 1a6d2bf3302f6e4305e383da0f27712dc5c20a67
> >>> Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
> >>> Date:   Sun Oct 29 20:28:03 2017 +0200
> >>>
> >>>     fix alignment of 16-bit uniforms on 32-bit slots
> >>>
> >>> diff --git a/src/intel/compiler/brw_fs_nir.cpp b/src/intel/compiler/brw_fs_nir.cpp
> >>> index 2f5443958a..586eb9d9ff 100644
> >>> --- a/src/intel/compiler/brw_fs_nir.cpp
> >>> +++ b/src/intel/compiler/brw_fs_nir.cpp
> >>> @@ -4007,7 +4007,10 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
> >>>           src.offset = const_offset->u32[0];
> >>>  
> >>>           for (unsigned j = 0; j < instr->num_components; j++) {
> >>> -            bld.MOV(offset(dest, bld, j), offset(src, bld, j));
> >>> +            const unsigned src_offset =
> >>> +              src.type == BRW_REGISTER_TYPE_HF ? 2 * j : j;
> >>> +
> >>> +            bld.MOV(offset(dest, bld, j), offset(src, bld, src_offset));
> >>>
> >>>
> >>>
> >>> Then about the change of using 32-bit slots. This is now unconditional and
> >>> would require revisiting if we wanted to pack 16-bits tighter and possibly
> >>> increase the amount of uniforms that can be pushed. Similarly to Vulkan, in
> >>> GL the core stores uniforms as floats and I think we should keep it that way.
> >>> I added support in the i965 backend to keep track of the types of the
> >>> uniforms and to convert 32-bit presentation to 16-bits on the fly in
> >>> gen6_constant_state.c::brw_param_value(). I don't like it that much but I had
> >>> to start from somewhere.
> >>> My thinking is that we'd want to decouple the storage of the values and the
> >>> packing used in the compiler backend. Ideally keeping the mesa gl core and the
> >>> api working with full 32-bit floats but using tight 16-bit slots in the
> >>> push/pull constant buffers.
> >>> This requires quite a bit more changes as we have structured
> >>> param[]/pull_param[] to work with 32-bit slots.
> >> At the beginning we though that we need to change to do that for VK
> >> 16-bit values on push constants, but in this case is client of the API
> >> the one that has control of how values are pushed. And it is the
> >> limitation related to offset multiples of 4 the one that simplifies the
> >> logic enabling the use of 32-bit slots.
> >>
> >> But as you comment in the case of uniforms, it would make sense to have
> >> an accountability to a level of byte or two byte-slot. But I think that
> >> it will only improve the situation for 16-bit scalar uniforms. Current
> >> push model will use two 32-bit slots for two scalar 16-bit uniforms when
> >> we could use half of that space with a byte level accountability. But in
> >> the case of vec2 and vec4 current model is working fine. Also in the
> >> case of array-of 16-bit scalars the consequence is just losing half-slot
> >> in the case of a last odd element.
> > Such as GL, I'm seeing how Vulkan tells how the storage and manipulation of
> > push constants should be. But that doesn't dictate how driver backend
> > chooses to compile the shader and how to construct the push/pull buffers
> > given to the hardware.
> >
> > In Anvil anv_cmd_buffer.c::anv_cmd_buffer_push_constants() allocates space
> > for the upload, consults the storage and can place the values in the
> > upload buffer independent of the storage. Such as I described for the
> > GL backend, see gen6_constant_state.c::brw_param_value().
> >
> > Or am I missing something else?
> 
> I think we have the same picture here. The backend can decide where
> values are placed.
> 
> >
> >>> My current work can be found in:
> >>>
> >>> git://people.freedesktop.org/~tpohjola/mesa 16_bit_gles
> >> I've bisected the code because of a regression of the 16bit_storage
> >> push_constant test is caused at this commit.
> > Right, that may easily break things - all that I have is just enough to
> > run a few piglit tests in tests/spec/glsl-es-1.00/execution/ that I modified
> > to use uniforms/varyings/math and a couple of benchmarks. I'm trying to get
> > a general idea how much work there is on top of yours to support gles.
> >
> >> commit a8d43602174fa6062839c568690b06c6fab4f2d1 (HEAD, refs/bisect/bad)
> >> Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
> >> Date:   Thu Oct 26 20:16:42 2017 +0300
> >>
> >>     i965: Add support for uploading 16-bit uniforms from 32-bit store
> >>
> >>     At this point 16-bit uniforms still take full 32-bit slots in the
> >>     pull/push constant buffers and in shader deployment payload.
> >>
> >>     Signed-off-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
> >>
> >>
> >>
> >> You would need to include in your branch to test the 16bit_storage push
> >> constant tests the last commit of our series.
> >>
> >> anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+
> > I haven't been running Vulkan with my tree - I only focused on gles shaders.
> > There are a few benchmarks which may benefit from 16-bits.
> >
>