[Mesa-dev] intel: 8 and 16-bit booleans

Thu Jan 10 19:18:09 UTC 2019

Topi just asked me on IRC what I thought about handling 16-bit booleans on
Intel hardware in the light of the 1-bit boolean stuff.  The current state
of the driver is that we use nir_lower_bool_to_int32 pass to produce NIR
that looks basically identical to the NIR we were getting in the back-end
before.  This lets us kick the can down the road a bit but I alluded in the
1-bit boolean series to ideas of doing something more intel-specific.
Instead of answering on IRC, I thought I'd send a mesa-dev mail so that we
can have a more universal discussion.

## The problem:

On Intel hardware, comparison operations generate two results: a flag
result which goes straight into the flag register and a destination result
which goes into the GRF pointed to by the CMP instruction's destination.
The flag result can be thought of as either a 32-bit bitfield scalar (in
the sense of one for all threads) or as a per-thread 1-bit value.  The GRF
value is a per-thread value whose size matches that of the execution size
of the instruction.  If you're comparing two 64-bit integers or floats, it
produces a 64-bit value (though I believe the top 32 bits are garbage).  On
a 32, 16, or 8-bit comparison, it produces a 32, 16, or 8-bit boolean
respectively.  The only reason why D3D booleans have historically been a
good match for our hardware is because we've historically only really cared
about 32-bit values.  With 64-bit types, we could just do a conversion and
write it off as "64-bit is expensive."  In the new world if 8 and 16-bit
types, however, that doesn't make nearly as much sense.

## Solutions:

The real question is what size we should make booleans in the back-end.
There are many different possible answers to this question but whatever
happens, it should probably happen in NIR so that we can make choices while
we're still in SSA.  I've considered a few different ideas on what we could
do:

 1. Make everything 16-bit.  8-bit is clumsy because of the weird stride
requirements but 32 and 64-bit can trivially be converted to 16-bit with a
strided integer MOV.  For the few places where we need an actual 32-bit
bool (b2f), a signed integer up-cast will do the trick.  For that matter,
just using a mixed-size AND with W type for the bool and D type for the
0x3f800000c might do the trick and keep it one instruction.

 2. Use the "native" boolean size for all comparison operations and then,
whenever we need to combine booleans via iand, bcsel, or a phi, you make
the result the smallest of the sources and insert extract_u16 or
extract_u32 to do a down-cast for the larger sources. (We want the extract
opcodes so we get a strided MOV which the back-end can more easily
eliminate.)

 3. Don't use comparison destinations at all and treat the flag as a 32 or
16-bit value (depending on dispatch width).  You can do a boolean AND by
just ANDing flag results and you have to write into the flag at the end in
order to use it.  This idea is a bit on the crazy side but it's interesting
to think about.

If idea 1 actually works, it would reduce register pressure a decent bit
which would be a very good thing.  However, I'm not sure how well we'll
actually be able to optimize with it.

Regardless of what we do, we'll need some new NIR instructions.  I think
that was more Topi's direct question.  I think the easiest thing to do
would be to make 16 or 64-bit versions of the comparison instructions we
have today.  We could make the binop_compare helper in nir_opcodes.py just
generate versions of the opcodes at all the bit sizes and call it a day.

In any case, there's my brain dump of ideas.  I hope some of it is useful.
I've tested none of it in practice.  Have fun!

--Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190110/e2195c53/attachment-0001.html>