[Mesa-dev] intel: 8 and 16-bit booleans

Fri Jan 11 09:35:10 UTC 2019

On Fri, 2019-01-11 at 08:05 +0100, Iago Toral wrote:
> On Thu, 2019-01-10 at 13:18 -0600, Jason Ekstrand wrote:
> > Topi just asked me on IRC what I thought about handling 16-bit
> > booleans on Intel hardware in the light of the 1-bit boolean
> > stuff.  The current state of the driver is that we use
> > nir_lower_bool_to_int32 pass to produce NIR that looks basically
> > identical to the NIR we were getting in the back-end before.  This
> > lets us kick the can down the road a bit but I alluded in the 1-bit 
> > boolean series to ideas of doing something more intel-specific. 
> > Instead of answering on IRC, I thought I'd send a mesa-dev mail so
> > that we can have a more universal discussion.
> > 
> > ## The problem:
> > 
> > On Intel hardware, comparison operations generate two results: a
> > flag result which goes straight into the flag register and a
> > destination result which goes into the GRF pointed to by the CMP
> > instruction's destination.  The flag result can be thought of as
> > either a 32-bit bitfield scalar (in the sense of one for all
> > threads) or as a per-thread 1-bit value.  The GRF value is a per-
> > thread value whose size matches that of the execution size of the
> > instruction.  If you're comparing two 64-bit integers or floats, it
> > produces a 64-bit value (though I believe the top 32 bits are
> > garbage).  On a 32, 16, or 8-bit comparison, it produces a 32, 16,
> > or 8-bit boolean respectively.  The only reason why D3D booleans
> > have historically been a good match for our hardware is because
> > we've historically only really cared about 32-bit values.  With 64-
> > bit types, we could just do a conversion and write it off as "64-
> > bit is expensive."  In the new world if 8 and 16-bit types,
> > however, that doesn't make nearly as much sense.
> > 
> > ## Solutions:
> > 
> > The real question is what size we should make booleans in the back-
> > end.  There are many different possible answers to this question
> > but whatever happens, it should probably happen in NIR so that we
> > can make choices while we're still in SSA.  I've considered a few
> > different ideas on what we could do:
> > 
> >  1. Make everything 16-bit.  8-bit is clumsy because of the weird
> > stride requirements but 32 and 64-bit can trivially be converted to
> > 16-bit with a strided integer MOV.  For the few places where we
> > need an actual 32-bit bool (b2f), a signed integer up-cast will do
> > the trick.  For that matter, just using a mixed-size AND with W
> > type for the bool and D type for the 0x3f800000c might do the trick
> > and keep it one instruction.
> > 
> >  2. Use the "native" boolean size for all comparison operations and
> > then, whenever we need to combine booleans via iand, bcsel, or a
> > phi, you make the result the smallest of the sources and insert
> > extract_u16 or extract_u32 to do a down-cast for the larger
> > sources. (We want the extract opcodes so we get a strided MOV which
> > the back-end can more easily eliminate.)
> > 
> >  3. Don't use comparison destinations at all and treat the flag as
> > a 32 or 16-bit value (depending on dispatch width).  You can do a
> > boolean AND by just ANDing flag results and you have to write into
> > the flag at the end in order to use it.  This idea is a bit on the
> > crazy side but it's interesting to think about.
> > 
> > If idea 1 actually works, it would reduce register pressure a
> > decent bit which would be a very good thing.  However, I'm not sure
> > how well we'll actually be able to optimize with it.
> 
> I have 1) implemented (I was planning to send a series for review
> that after we land the 8-bit and 16-bit series). I think it is
> working quite well for me, but of course I only have the CTS tests to
> play with. Here are some numbers:
> 
> VK_KHR_shader_float16_int8 branch:
> 
>                                     |        SIMD8    |        SIMD16
>      |
> -------------------------------------------------------------------
> ----------
> spirv_assembly.type.scalar.i8.*     |      19,725     |        2,044 
>      |
> spirv_assembly.type.scalar.i16.*    |      35,504     |        3,650 
>      |
> instruction.graphics.float16.*      |     305,129     |       29,760 
>      |
> builtin.precision*.comparison.*     |                 |        2,284 
>      |
> 
> VK_KHR_shader_float16_int8 + 8-bit/16-bit booleans:
> 
>                                     |        SIMD8    |        SIMD16
>      |
> -------------------------------------------------------------------
> ----------
> spirv_assembly.type.scalar.i8.*     |      19,718     |        2,043 
>      |
> spirv_assembly.type.scalar.i16.*    |      35,369     |        3,645 
>      |
> instruction.graphics.float16.*      |     302,764     |       29,627 
>      |
> builtin.precision*.comparison.*     |        -        |        2,144 
>      |
> 
> I see benefits across the board. It is not a huge improvement, but
> there is some. Getting 8-bit booleans to produce a lower number of
> instructions took a bit more of work because the hardware doesn't
> support 8-bit immediates, so the usual comparisons with 0 that we
> emit (specifically for bcsel) would produdce worse code since we
> would need to emit a MOV to a VGRF to handle the constant argument. I
> fixed this by using a MOV.NZ to write the flag and then I had to
> patch the CSE pass to work on MOV instructions witha NULL destination
> which should be safe, and with that it is about the same as 32-bit
> booleans, with maybe 1 or 2 instructions less in a few shaders I was
> playing with, so I think it is probably worth a try. I'd definitely
> do this for 16-bit booleans at least.
One more thing about 8-bit booleans. Due to the hardware restrictions
affecting Byte types, we end up having to align them all the time to
Word,m so in practice I think they do not really bring an advantage and
emitting 16-bit booelans for them might be a better solution unless we
have reason to believe that Byte instructions have better ALU
throughput to compensate for the extra hassle.
> Another thing, the boolean lowering I have is not perfect. When we
> find the need to make canonical booleans, or when we have undef
> operands, I just take an asy way out, but I think we could probably
> do something smater in some cases to reduce the number of conversions
> which should allow us to produce even better instruction counts.
> 
> I've just uploaded a branch here if you want to look at the
> implementation of this I have right now (patches at the tip of the
> branch, son top of the float16/int8 implementation):
> 
> 
https://github.com/Igalia/mesa/tree/itoral/VK_KHR_shader_float16_int8_1bit_bool
> 
> 
> > Regardless of what we do, we'll need some new NIR instructions.  I
> > think that was more Topi's direct question.  I think the easiest
> > thing to do would be to make 16 or 64-bit versions of the
> > comparison instructions we have today.  We could make the
> > binop_compare helper in nir_opcodes.py just generate versions of
> > the opcodes at all the bit sizes and call it a day.
> 
> I added a bunch of opcodes in my branch, maybe that's enough? I guess
> we could auto-generate some of those if we want.  I am not sure if we
> would need more stuff for GLSL, but for SPIR-V that seems to be all
> that I needed going by the existing CTS tests.
> 
> > In any case, there's my brain dump of ideas.  I hope some of it is
> > useful.  I've tested none of it in practice.  Have fun!
> > 
> > --Jason
> > 
> 
> _______________________________________________mesa-dev mailing
> listmesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190111/5fbe4a7c/attachment-0001.html>