[Mesa-dev] [PATCH 2/6] glsl: Implement any(v) as any_nequal(v, false).

Tue Dec 1 11:28:20 PST 2015

On Tue, Dec 1, 2015 at 2:13 PM, Ian Romanick <idr at freedesktop.org> wrote:
> On 11/30/2015 04:41 PM, Ilia Mirkin wrote:
>> On Mon, Nov 30, 2015 at 7:15 PM, Matt Turner <mattst88 at gmail.com> wrote:
>>> On Mon, Nov 30, 2015 at 3:51 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>> On Mon, Nov 30, 2015 at 6:38 PM, Matt Turner <mattst88 at gmail.com> wrote:
>>>>> On Mon, Nov 30, 2015 at 3:34 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>>>> On Mon, Nov 30, 2015 at 6:32 PM, Matt Turner <mattst88 at gmail.com> wrote:
>>>>>>> ---
>>>>>>>  src/glsl/builtin_functions.cpp | 15 ++++++++++++++-
>>>>>>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/src/glsl/builtin_functions.cpp b/src/glsl/builtin_functions.cpp
>>>>>>> index 88f9a71..2f21ffc 100644
>>>>>>> --- a/src/glsl/builtin_functions.cpp
>>>>>>> +++ b/src/glsl/builtin_functions.cpp
>>>>>>> @@ -538,6 +538,7 @@ private:
>>>>>>>     ir_variable *in_var(const glsl_type *type, const char *name);
>>>>>>>     ir_variable *out_var(const glsl_type *type, const char *name);
>>>>>>>     ir_constant *imm(float f, unsigned vector_elements=1);
>>>>>>> +   ir_constant *imm(bool b, unsigned vector_elements=1);
>>>>>>>     ir_constant *imm(int i, unsigned vector_elements=1);
>>>>>>>     ir_constant *imm(unsigned u, unsigned vector_elements=1);
>>>>>>>     ir_constant *imm(double d, unsigned vector_elements=1);
>>>>>>> @@ -3000,6 +3001,12 @@ builtin_builder::out_var(const glsl_type *type, const char *name)
>>>>>>>  }
>>>>>>>
>>>>>>>  ir_constant *
>>>>>>> +builtin_builder::imm(bool b, unsigned vector_elements)
>>>>>>> +{
>>>>>>> +   return new(mem_ctx) ir_constant(b, vector_elements);
>>>>>>> +}
>>>>>>> +
>>>>>>> +ir_constant *
>>>>>>>  builtin_builder::imm(float f, unsigned vector_elements)
>>>>>>>  {
>>>>>>>     return new(mem_ctx) ir_constant(f, vector_elements);
>>>>>>> @@ -4358,7 +4365,13 @@ builtin_builder::_notEqual(builtin_available_predicate avail,
>>>>>>>  ir_function_signature *
>>>>>>>  builtin_builder::_any(const glsl_type *type)
>>>>>>>  {
>>>>>>> -   return unop(always_available, ir_unop_any, glsl_type::bool_type, type);
>>>>>>> +   ir_variable *v = in_var(type, "v");
>>>>>>> +   MAKE_SIG(glsl_type::bool_type, always_available, 1, v);
>>>>>>> +
>>>>>>> +   const unsigned vec_elem = v->type->vector_elements;
>>>>>>> +   body.emit(ret(expr(ir_binop_any_nequal, v, imm(false, vec_elem))));
>>>>>>
>>>>>> This results in an extra comparison generated by the st_glsl_to_tgsi
>>>>>> code, which can be annoying to get rid of... Why do you need to do
>>>>>> this?
>>>>>
>>>>> Couldn't we fix the st_glsl_to_tgsi to handle that better? I wouldn't
>>>>> be surprised to find such a thing in the wild.
>>>>
>>>> I guess it could special-case the exact IR generated there? The thing
>>>> is that it just OR's everything together (when you have native integer
>>>> support... which everything does outside of r300/nv30), which you can
>>>> only do in a few select cases. If you'd like to implement that, I
>>>> wouldn't object. As-is, it's a pessimization though.
>>>
>>> I'm not familiar with any of this code and I don't know how to test or
>>> debug it -- sorry.
>>>
>>> It should be trivial to optimize if someone cares. I believe all you
>>> need to do is skip emitting the SNE and instead copy src[0] to temp
>>> under these circumstances:
>>>
>>>    if (ir->operands[0]->type == glsl_type::bool_type &&
>>>        ir->operands[1]->as_constant() &&
>>>        ir->operands[1]->as_constant()->is_zero())
>>
>> As long as you can build the gallium swrast (and enable debug), you
>> can add ST_DEBUG=tgsi to get the generated TGSI printed out.
>
> Does the generated TGSI really matter?  We don't generally care if a
> change causes more NIR instructions to be generated, for example...
> unless it's a massive amount more.

Agreed.

>
> The real question is what actually gets executed on the GPUs.  If we
> produce more intermediate instructions, but the backends still generate
> the same (or perhaps even better) code, it should be okay.  Does any GPU
> actually emit worse code due to this change or the related changes later
> in the series?

I full agree with your comments. If there's a sampler shader, I can
easily test what nouveau does with it... I was going based on how I
thought it would work though based on my familiarity with the various
optimizations. I could have gotten it wrong though. The problem is
that the boolean-ness of the values gets lost, and these optimizations
become less safe. Or less detected.

I can guarantee that it'll generate an extra instruction nv30 since
that doesn't have a compiler, so it'll go from just "DP4" to "SNE;
DP4". But I don't know if I care *that* much.

So this is the test I came up with:

FRAG
DCL IN[0], GENERIC[0], PERSPECTIVE
DCL OUT[0], COLOR
DCL TEMP[0..4], LOCAL
IMM[0] UINT32 { 0, 0, 0, 0 }
0: MOV TEMP[0], IN[0]
1: USNE TEMP[0], TEMP[0], IMM[0]
2: OR TEMP[0].x, TEMP[0].xxxx, TEMP[0].yyyy
3: OR TEMP[0].x, TEMP[0].xxxx, TEMP[0].zzzz
4: OR TEMP[0].x, TEMP[0].xxxx, TEMP[0].wwww
5: MOV OUT[0], TEMP[0].xxxx
9: END

So let's see what happens on nv50 (targeted at a G84 to avoid
potential G80-related idiocy... don't think there would be any here,
but who knows):

NV50_PROG_DEBUG=1 ./src/gallium/drivers/nouveau/nouveau_compiler -a 84 slt.tgsi

which ends up in:

EMIT: mov u32 $r4 0x00000000 (8)
EMIT: set u32 $r1 ne $r1 $r4 (8)
EMIT: set u32 $r2 ne $r2 $r4 (8)
EMIT: set u32 $r3 ne $r3 $r4 (8)
EMIT: set u32 $r0 ne $r0 $r4 (8)
EMIT: or u32 $r1 $r1 $r2 (8)
EMIT: or u32 $r1 $r1 $r3 (8)
EMIT: or u32 $r3 $r1 $r0 (8)

Naturally, those set's all go away if the USNE does. With -a c0 (i.e.
fermi, which will be ~identical for everything later as well), we get:

  6: set u8 $p0 ne u32 $r1 $r63 (8)
  7: set or u8 $p0 ne u32 $r2 $r63 $p0 (8)
  8: set or u8 $p0 ne u32 $r3 $r63 $p0 (8)
  9: set or u32 $r3 ne $r0 $r63 $p0 (8)

[$r63 is always 0]

And without the USNE it just has the or's:

  6: or u32 $r1 $r1 $r2 (8)
  7: or u32 $r1 $r1 $r3 (8)
  8: or u32 $r3 $r1 $r0 (8)

[I'm ignoring the unrelated bits, like moving stuff around,
interpolation, etc]. If nothing else, this is one less instruction.

Basically the issue is that we don't have the information that the
input is a *clean* bool, like we would if there were type info, and so
we must "clean" it first. We could try to see if the input comes from
another set... and we sometimes do, but for a limited quantity of
scenarios, and I don't think this is one of them.

  -ilia