[Mesa-dev] intel: 8 and 16-bit booleans

Fri Jan 11 07:05:26 UTC 2019

On Thu, 2019-01-10 at 13:18 -0600, Jason Ekstrand wrote:
> Topi just asked me on IRC what I thought about handling 16-bit
> booleans on Intel hardware in the light of the 1-bit boolean stuff. 
> The current state of the driver is that we use
> nir_lower_bool_to_int32 pass to produce NIR that looks basically
> identical to the NIR we were getting in the back-end before.  This
> lets us kick the can down the road a bit but I alluded in the 1-bit
> boolean series to ideas of doing something more intel-specific. 
> Instead of answering on IRC, I thought I'd send a mesa-dev mail so
> that we can have a more universal discussion.
> 
> ## The problem:
> 
> On Intel hardware, comparison operations generate two results: a flag
> result which goes straight into the flag register and a destination
> result which goes into the GRF pointed to by the CMP instruction's
> destination.  The flag result can be thought of as either a 32-bit
> bitfield scalar (in the sense of one for all threads) or as a per-
> thread 1-bit value.  The GRF value is a per-thread value whose size
> matches that of the execution size of the instruction.  If you're
> comparing two 64-bit integers or floats, it produces a 64-bit value
> (though I believe the top 32 bits are garbage).  On a 32, 16, or 8-
> bit comparison, it produces a 32, 16, or 8-bit boolean respectively. 
> The only reason why D3D booleans have historically been a good match
> for our hardware is because we've historically only really cared
> about 32-bit values.  With 64-bit types, we could just do a
> conversion and write it off as "64-bit is expensive."  In the new
> world if 8 and 16-bit types, however, that doesn't make nearly as
> much sense.
> 
> ## Solutions:
> 
> The real question is what size we should make booleans in the back-
> end.  There are many different possible answers to this question but
> whatever happens, it should probably happen in NIR so that we can
> make choices while we're still in SSA.  I've considered a few
> different ideas on what we could do:
> 
>  1. Make everything 16-bit.  8-bit is clumsy because of the weird
> stride requirements but 32 and 64-bit can trivially be converted to
> 16-bit with a strided integer MOV.  For the few places where we need
> an actual 32-bit bool (b2f), a signed integer up-cast will do the
> trick.  For that matter, just using a mixed-size AND with W type for
> the bool and D type for the 0x3f800000c might do the trick and keep
> it one instruction.
> 
>  2. Use the "native" boolean size for all comparison operations and
> then, whenever we need to combine booleans via iand, bcsel, or a phi,
> you make the result the smallest of the sources and insert
> extract_u16 or extract_u32 to do a down-cast for the larger sources.
> (We want the extract opcodes so we get a strided MOV which the back-
> end can more easily eliminate.)
> 
>  3. Don't use comparison destinations at all and treat the flag as a
> 32 or 16-bit value (depending on dispatch width).  You can do a
> boolean AND by just ANDing flag results and you have to write into
> the flag at the end in order to use it.  This idea is a bit on the
> crazy side but it's interesting to think about.
> 
> If idea 1 actually works, it would reduce register pressure a decent
> bit which would be a very good thing.  However, I'm not sure how well
> we'll actually be able to optimize with it.

I have 1) implemented (I was planning to send a series for review that
after we land the 8-bit and 16-bit series). I think it is working quite
well for me, but of course I only have the CTS tests to play with. Here
are some numbers:
VK_KHR_shader_float16_int8 branch:
                                    |        SIMD8    |        SIMD16  
   |-------------------------------------------------------------------
----------
spirv_assembly.type.scalar.i8.*     |      19,725     |        2,044   
   |spirv_assembly.type.scalar.i16.*    |      35,504     |        3,65
0      |instruction.graphics.float16.*      |     305,129     |       2
9,760      |builtin.precision*.comparison.*     |                 |    
    2,284      |
VK_KHR_shader_float16_int8 + 8-bit/16-bit booleans:
                                    |        SIMD8    |        SIMD16  
   |-------------------------------------------------------------------
----------
spirv_assembly.type.scalar.i8.*     |      19,718     |        2,043   
   |spirv_assembly.type.scalar.i16.*    |      35,369     |        3,64
5      |instruction.graphics.float16.*      |     302,764     |       2
9,627      |builtin.precision*.comparison.*     |        -        |    
    2,144      |
I see benefits across the board. It is not a huge improvement, but
there is some. Getting 8-bit booleans to produce a lower number of
instructions took a bit more of work because the hardware doesn't
support 8-bit immediates, so the usual comparisons with 0 that we emit
(specifically for bcsel) would produdce worse code since we would need
to emit a MOV to a VGRF to handle the constant argument. I fixed this
by using a MOV.NZ to write the flag and then I had to patch the CSE
pass to work on MOV instructions witha NULL destination which should be
safe, and with that it is about the same as 32-bit booleans, with maybe
1 or 2 instructions less in a few shaders I was playing with, so I
think it is probably worth a try. I'd definitely do this for 16-bit
booleans at least.
Another thing, the boolean lowering I have is not perfect. When we find
the need to make canonical booleans, or when we have undef operands, I
just take an asy way out, but I think we could probably do something
smater in some cases to reduce the number of conversions which should
allow us to produce even better instruction counts.
I've just uploaded a branch here if you want to look at the
implementation of this I have right now (patches at the tip of the
branch, son top of the float16/int8 implementation):

https://github.com/Igalia/mesa/tree/itoral/VK_KHR_shader_float16_int8_1bit_bool

> Regardless of what we do, we'll need some new NIR instructions.  I
> think that was more Topi's direct question.  I think the easiest
> thing to do would be to make 16 or 64-bit versions of the comparison
> instructions we have today.  We could make the binop_compare helper
> in nir_opcodes.py just generate versions of the opcodes at all the
> bit sizes and call it a day.

I added a bunch of opcodes in my branch, maybe that's enough? I guess
we could auto-generate some of those if we want.  I am not sure if we
would need more stuff for GLSL, but for SPIR-V that seems to be all
that I needed going by the existing CTS tests.
> In any case, there's my brain dump of ideas.  I hope some of it is
> useful.  I've tested none of it in practice.  Have fun!
> 
> --Jason
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190111/24a0c25f/attachment-0001.html>