[Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

Mon Oct 30 16:10:53 UTC 2017

El 30/10/17 a las 07:44, Pohjolainen, Topi escribió:
> On Sun, Oct 29, 2017 at 11:17:11PM +0100, Chema Casanova wrote:
>> On 29/10/17 19:55, Pohjolainen, Topi wrote:
>>> On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
>>>> We enable the use of 16-bit values in push constants
>>>> modifying the assign_constant_locations function to work
>>>> with 16-bit types.
>>>>
>>>> The API to access buffers in Vulkan use multiples of 4-byte for
>>>> offsets and sizes. Current accountability of uniforms based on 4-byte
>>>> slots will work for 16-bit values if they are allowed to use 32-bit
>>>> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
>>>> 2-byte elements will use 1 slot instead of 0.
>>>>
>>>> We aligns the 16-bit locations after assigning the 32-bit
>>>> ones.
>>>> ---
>>>>  src/intel/compiler/brw_fs.cpp | 30 +++++++++++++++++++++++-------
>>>>  1 file changed, 23 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>> index a1d49a63be..8da16145dc 100644
>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>> @@ -1909,8 +1909,9 @@ set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
>>>>     if (!contiguous) {
>>>>        /* If bitsize doesn't match the target one, skip it */
>>>>        if (*max_chunk_bitsize != target_bitsize) {
>>>> -         /* FIXME: right now we only support 32 and 64-bit accesses */
>>>> -         assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
>>>> +         assert(*max_chunk_bitsize == 4 ||
>>>> +                *max_chunk_bitsize == 8 ||
>>>> +                *max_chunk_bitsize == 2);
>>>>           *max_chunk_bitsize = 0;
>>>>           *chunk_start = -1;
>>>>           return;
>>>> @@ -1987,8 +1988,9 @@ fs_visitor::assign_constant_locations()
>>>>           int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
>>> Did you test this with, for example, vec4?
>> CTS has 16bit scalar, vec2 (uint,sint), vec4 (float) and matrix tests
>> for push constants for compute and graphics pipelines. For vec4 you can try:
>>
>> dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant_16_to_32.vector_float
>>
>> For push constant tests in general there are 42 tests, but vec3 aren't
>> tested:
>>
>> dEQP-VK.*16bit_storage.*push_constant.
>>
>>
>>> I've been toying with a glsl
>>> lowering pass changing mediump floats into float16. I was curious to know how
>>> much is needed as you have addressed most of the things from NIR onwards.
>>> Here I'm seeing offsets 0,2,4,6 which result into 0,0,1,1 when divided by
>>> four. Don't we need something of this sort in addition?
>> If i remember correctly, tests were testing to use push constants with
>> 64 16bit values, to use the minimum spec maximum available as
>> max_push_constants_size that is 128 bytes. So at the end the generated
>> intrinsic was:
>>
>> vec4 16 ssa_4 = intrinsic load_uniform (ssa_3) () (0, 128) /* base=0 */
>> /* range=128 */
>>
>> As the calculus here is to calculate the number of location used, and
>> taking into account that the Vulkan API restrictions for push constants
>> that says that push constant ranges that say that offset must be
>> multiple of 4 and size must be multiple of 4, maintain the use of
>> 4-bytes slots was ok for supporting the feature. Our code changes just
>> take the accountability in the number of 32-bits location needed, mainly
>> changing the divisions by 4 using DIV_ROUND_UP( , 4) to calculate sizes.
> I'm probably misunderstanding something. Let me ask a few clarifying questions.
>
> I'm reading that the incoming 16-bit values are given in 32-bit slots, and for
> the same reason we place them in the push/pull buffers in 32-bits slots. In
> other words a vec4 would take 16-bytes and each component would 32-bits apart?

Probably I explained quite bad. A f16vec4 would use 8-bytes, and each
component
is going to be 16-bits apart. The 32-bit multiple offset only applies to
the first
element.

> If that is the case, then don't we need to adjust the register offsets
> somewhere the way I did in the fragment below? Otherwise the offsets will
> point to locations in the register that are simply 16-bits apart?

Yes components are 16-bits apart. Because of that we don't need anything
 especial to adjust the offsets. At least we didn't needed for existing
tests.

>
>>> commit 1a6d2bf3302f6e4305e383da0f27712dc5c20a67
>>> Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
>>> Date:   Sun Oct 29 20:28:03 2017 +0200
>>>
>>>     fix alignment of 16-bit uniforms on 32-bit slots
>>>
>>> diff --git a/src/intel/compiler/brw_fs_nir.cpp b/src/intel/compiler/brw_fs_nir.cpp
>>> index 2f5443958a..586eb9d9ff 100644
>>> --- a/src/intel/compiler/brw_fs_nir.cpp
>>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>>> @@ -4007,7 +4007,10 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>           src.offset = const_offset->u32[0];
>>>  
>>>           for (unsigned j = 0; j < instr->num_components; j++) {
>>> -            bld.MOV(offset(dest, bld, j), offset(src, bld, j));
>>> +            const unsigned src_offset =
>>> +              src.type == BRW_REGISTER_TYPE_HF ? 2 * j : j;
>>> +
>>> +            bld.MOV(offset(dest, bld, j), offset(src, bld, src_offset));
>>>
>>>
>>>
>>> Then about the change of using 32-bit slots. This is now unconditional and
>>> would require revisiting if we wanted to pack 16-bits tighter and possibly
>>> increase the amount of uniforms that can be pushed. Similarly to Vulkan, in
>>> GL the core stores uniforms as floats and I think we should keep it that way.
>>> I added support in the i965 backend to keep track of the types of the
>>> uniforms and to convert 32-bit presentation to 16-bits on the fly in
>>> gen6_constant_state.c::brw_param_value(). I don't like it that much but I had
>>> to start from somewhere.
>>> My thinking is that we'd want to decouple the storage of the values and the
>>> packing used in the compiler backend. Ideally keeping the mesa gl core and the
>>> api working with full 32-bit floats but using tight 16-bit slots in the
>>> push/pull constant buffers.
>>> This requires quite a bit more changes as we have structured
>>> param[]/pull_param[] to work with 32-bit slots.
>> At the beginning we though that we need to change to do that for VK
>> 16-bit values on push constants, but in this case is client of the API
>> the one that has control of how values are pushed. And it is the
>> limitation related to offset multiples of 4 the one that simplifies the
>> logic enabling the use of 32-bit slots.
>>
>> But as you comment in the case of uniforms, it would make sense to have
>> an accountability to a level of byte or two byte-slot. But I think that
>> it will only improve the situation for 16-bit scalar uniforms. Current
>> push model will use two 32-bit slots for two scalar 16-bit uniforms when
>> we could use half of that space with a byte level accountability. But in
>> the case of vec2 and vec4 current model is working fine. Also in the
>> case of array-of 16-bit scalars the consequence is just losing half-slot
>> in the case of a last odd element.
> Such as GL, I'm seeing how Vulkan tells how the storage and manipulation of
> push constants should be. But that doesn't dictate how driver backend
> chooses to compile the shader and how to construct the push/pull buffers
> given to the hardware.
>
> In Anvil anv_cmd_buffer.c::anv_cmd_buffer_push_constants() allocates space
> for the upload, consults the storage and can place the values in the
> upload buffer independent of the storage. Such as I described for the
> GL backend, see gen6_constant_state.c::brw_param_value().
>
> Or am I missing something else?

I think we have the same picture here. The backend can decide where
values are placed.

>
>>> My current work can be found in:
>>>
>>> git://people.freedesktop.org/~tpohjola/mesa 16_bit_gles
>> I've bisected the code because of a regression of the 16bit_storage
>> push_constant test is caused at this commit.
> Right, that may easily break things - all that I have is just enough to
> run a few piglit tests in tests/spec/glsl-es-1.00/execution/ that I modified
> to use uniforms/varyings/math and a couple of benchmarks. I'm trying to get
> a general idea how much work there is on top of yours to support gles.
>
>> commit a8d43602174fa6062839c568690b06c6fab4f2d1 (HEAD, refs/bisect/bad)
>> Author: Topi Pohjolainen <topi.pohjolainen at intel.com>
>> Date:   Thu Oct 26 20:16:42 2017 +0300
>>
>>     i965: Add support for uploading 16-bit uniforms from 32-bit store
>>
>>     At this point 16-bit uniforms still take full 32-bit slots in the
>>     pull/push constant buffers and in shader deployment payload.
>>
>>     Signed-off-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
>>
>>
>>
>> You would need to include in your branch to test the 16bit_storage push
>> constant tests the last commit of our series.
>>
>> anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+
> I haven't been running Vulkan with my tree - I only focused on gles shaders.
> There are a few benchmarks which may benefit from 16-bits.
>