[Mesa-dev] intel: 8 and 16-bit booleans
Iago Toral
itoral at igalia.com
Fri Jan 11 07:05:26 UTC 2019
On Thu, 2019-01-10 at 13:18 -0600, Jason Ekstrand wrote:
> Topi just asked me on IRC what I thought about handling 16-bit
> booleans on Intel hardware in the light of the 1-bit boolean stuff.
> The current state of the driver is that we use
> nir_lower_bool_to_int32 pass to produce NIR that looks basically
> identical to the NIR we were getting in the back-end before. This
> lets us kick the can down the road a bit but I alluded in the 1-bit
> boolean series to ideas of doing something more intel-specific.
> Instead of answering on IRC, I thought I'd send a mesa-dev mail so
> that we can have a more universal discussion.
>
> ## The problem:
>
> On Intel hardware, comparison operations generate two results: a flag
> result which goes straight into the flag register and a destination
> result which goes into the GRF pointed to by the CMP instruction's
> destination. The flag result can be thought of as either a 32-bit
> bitfield scalar (in the sense of one for all threads) or as a per-
> thread 1-bit value. The GRF value is a per-thread value whose size
> matches that of the execution size of the instruction. If you're
> comparing two 64-bit integers or floats, it produces a 64-bit value
> (though I believe the top 32 bits are garbage). On a 32, 16, or 8-
> bit comparison, it produces a 32, 16, or 8-bit boolean respectively.
> The only reason why D3D booleans have historically been a good match
> for our hardware is because we've historically only really cared
> about 32-bit values. With 64-bit types, we could just do a
> conversion and write it off as "64-bit is expensive." In the new
> world if 8 and 16-bit types, however, that doesn't make nearly as
> much sense.
>
> ## Solutions:
>
> The real question is what size we should make booleans in the back-
> end. There are many different possible answers to this question but
> whatever happens, it should probably happen in NIR so that we can
> make choices while we're still in SSA. I've considered a few
> different ideas on what we could do:
>
> 1. Make everything 16-bit. 8-bit is clumsy because of the weird
> stride requirements but 32 and 64-bit can trivially be converted to
> 16-bit with a strided integer MOV. For the few places where we need
> an actual 32-bit bool (b2f), a signed integer up-cast will do the
> trick. For that matter, just using a mixed-size AND with W type for
> the bool and D type for the 0x3f800000c might do the trick and keep
> it one instruction.
>
> 2. Use the "native" boolean size for all comparison operations and
> then, whenever we need to combine booleans via iand, bcsel, or a phi,
> you make the result the smallest of the sources and insert
> extract_u16 or extract_u32 to do a down-cast for the larger sources.
> (We want the extract opcodes so we get a strided MOV which the back-
> end can more easily eliminate.)
>
> 3. Don't use comparison destinations at all and treat the flag as a
> 32 or 16-bit value (depending on dispatch width). You can do a
> boolean AND by just ANDing flag results and you have to write into
> the flag at the end in order to use it. This idea is a bit on the
> crazy side but it's interesting to think about.
>
> If idea 1 actually works, it would reduce register pressure a decent
> bit which would be a very good thing. However, I'm not sure how well
> we'll actually be able to optimize with it.
I have 1) implemented (I was planning to send a series for review that
after we land the 8-bit and 16-bit series). I think it is working quite
well for me, but of course I only have the CTS tests to play with. Here
are some numbers:
VK_KHR_shader_float16_int8 branch:
| SIMD8 | SIMD16
|-------------------------------------------------------------------
----------
spirv_assembly.type.scalar.i8.* | 19,725 | 2,044
|spirv_assembly.type.scalar.i16.* | 35,504 | 3,65
0 |instruction.graphics.float16.* | 305,129 | 2
9,760 |builtin.precision*.comparison.* | |
2,284 |
VK_KHR_shader_float16_int8 + 8-bit/16-bit booleans:
| SIMD8 | SIMD16
|-------------------------------------------------------------------
----------
spirv_assembly.type.scalar.i8.* | 19,718 | 2,043
|spirv_assembly.type.scalar.i16.* | 35,369 | 3,64
5 |instruction.graphics.float16.* | 302,764 | 2
9,627 |builtin.precision*.comparison.* | - |
2,144 |
I see benefits across the board. It is not a huge improvement, but
there is some. Getting 8-bit booleans to produce a lower number of
instructions took a bit more of work because the hardware doesn't
support 8-bit immediates, so the usual comparisons with 0 that we emit
(specifically for bcsel) would produdce worse code since we would need
to emit a MOV to a VGRF to handle the constant argument. I fixed this
by using a MOV.NZ to write the flag and then I had to patch the CSE
pass to work on MOV instructions witha NULL destination which should be
safe, and with that it is about the same as 32-bit booleans, with maybe
1 or 2 instructions less in a few shaders I was playing with, so I
think it is probably worth a try. I'd definitely do this for 16-bit
booleans at least.
Another thing, the boolean lowering I have is not perfect. When we find
the need to make canonical booleans, or when we have undef operands, I
just take an asy way out, but I think we could probably do something
smater in some cases to reduce the number of conversions which should
allow us to produce even better instruction counts.
I've just uploaded a branch here if you want to look at the
implementation of this I have right now (patches at the tip of the
branch, son top of the float16/int8 implementation):
https://github.com/Igalia/mesa/tree/itoral/VK_KHR_shader_float16_int8_1bit_bool
> Regardless of what we do, we'll need some new NIR instructions. I
> think that was more Topi's direct question. I think the easiest
> thing to do would be to make 16 or 64-bit versions of the comparison
> instructions we have today. We could make the binop_compare helper
> in nir_opcodes.py just generate versions of the opcodes at all the
> bit sizes and call it a day.
I added a bunch of opcodes in my branch, maybe that's enough? I guess
we could auto-generate some of those if we want. I am not sure if we
would need more stuff for GLSL, but for SPIR-V that seems to be all
that I needed going by the existing CTS tests.
> In any case, there's my brain dump of ideas. I hope some of it is
> useful. I've tested none of it in practice. Have fun!
>
> --Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190111/24a0c25f/attachment-0001.html>
More information about the mesa-dev
mailing list