[Mesa-dev] [PATCH 07/10] i965: Replace brw_wm_* with dumping code into the fs_visitor.

Thu Sep 27 23:41:49 PDT 2012

On 09/27/2012 05:27 PM, Eric Anholt wrote:
> Kenneth Graunke <kenneth at whitecape.org> writes:
> 
>> On 09/22/2012 02:04 PM, Eric Anholt wrote:
>>> This makes a giant pile of code newly dead.  It also fixes TXB on newer
>>> chipsets, which has been totally broken (I now have a piglit test for that).
>>> It passes the same set of Ian's ARB_fragment_program tests.  It also improves
>>> high-settings ETQW performance by 3.2 +/- 1.9% (n=3), thanks to better
>>> optimization and having 8-wide along with 16-wide shaders.
>>> ---
>>>  src/mesa/drivers/dri/i965/Makefile.sources   |    1 +
>>>  src/mesa/drivers/dri/i965/brw_fs.cpp         |   36 +-
>>>  src/mesa/drivers/dri/i965/brw_fs.h           |   30 +-
>>>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp    |   22 +-
>>>  src/mesa/drivers/dri/i965/brw_fs_fp.cpp      |  781 ++++++++++++++++++++++++++
>>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |    3 +-
>>>  src/mesa/drivers/dri/i965/brw_wm.c           |   58 +-
>>>  src/mesa/drivers/dri/i965/brw_wm_state.c     |   19 +-
>>>  src/mesa/drivers/dri/i965/gen6_wm_state.c    |    8 +-
>>>  src/mesa/drivers/dri/i965/gen7_wm_state.c    |    8 +-
>>>  10 files changed, 857 insertions(+), 109 deletions(-)
>>>  create mode 100644 src/mesa/drivers/dri/i965/brw_fs_fp.cpp
>>
>> I think the LIT code may be broken (comments inline), and one comment is
>> wrong.  Assuming you fix (or refute) those, then patches 1-8 are:
>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>>
>> I haven't read through 9 and 10 yet, but I plan to soon.
> 
>>> +void
>>> +fs_visitor::emit_fragment_program_code()
>>> +{
>>> +   setup_fp_regs();
>>> +
>>> +   fs_reg null = fs_reg(brw_null_reg());
>>> +
>>> +   /* Keep a reg with 0.0 around, for reuse use by emit_sop so that it can
>>
>> "Keep a reg with 1.0 around, for reuse by emit_fp_sop"
>>                  ^^^ (not 0.0)                 ^^ (function name)
> 
> fixed
> 
>>> +      case OPCODE_DP2:
>>> +      case OPCODE_DP3:
>>> +      case OPCODE_DP4:
>>> +      case OPCODE_DPH: {
>>> +         fs_reg mul = fs_reg(this, glsl_type::float_type);
>>> +         fs_reg acc = fs_reg(this, glsl_type::float_type);
>>> +         int count;
>>> +
>>> +         switch (fpi->Opcode) {
>>> +         case OPCODE_DP2: count = 2; break;
>>> +         case OPCODE_DP3: count = 3; break;
>>> +         case OPCODE_DP4: count = 4; break;
>>> +         case OPCODE_DPH: count = 3; break;
>>> +         default: assert(!"not reached"); count = 0; break;
>>> +         }
>>> +
>>> +         emit(BRW_OPCODE_MUL, acc,
>>> +              regoffset(src[0], 0), regoffset(src[1], 0));
>>> +         for (int i = 1; i < count; i++) {
>>> +            emit(BRW_OPCODE_MUL, mul,
>>> +                 regoffset(src[0], i), regoffset(src[1], i));
>>> +            emit(BRW_OPCODE_ADD, acc, acc, mul);
>>> +         }
>>
>> Future optimization: MAD would be nice here, but that can be done later.
> 
> Yeah, or even MACs.  This is a codegen quality regression from
> brw_wm_*.c.
> 
>>> +      case OPCODE_LIT:
>>> +         /* From the ARB_fragment_program spec:
>>> +          *
>>> +          *      tmp = VectorLoad(op0);
>>> +          *      if (tmp.x < 0) tmp.x = 0;
>>> +          *      if (tmp.y < 0) tmp.y = 0;
>>> +          *      if (tmp.w < -(128.0-epsilon)) tmp.w = -(128.0-epsilon);
>>> +          *      else if (tmp.w > 128-epsilon) tmp.w = 128-epsilon;
>>> +          *      result.x = 1.0;
>>> +          *      result.y = tmp.x;
>>> +          *      result.z = (tmp.x > 0) ? RoughApproxPower(tmp.y, tmp.w) : 0.0;
>>> +          *      result.w = 1.0;
>>> +          */
>>> +         if (fpi->DstReg.WriteMask & WRITEMASK_X)
>>> +            emit(BRW_OPCODE_MOV, regoffset(dst, 0), fs_reg(1.0f));
>>> +
>>> +         if (fpi->DstReg.WriteMask & WRITEMASK_YZ) {
>>> +            fs_inst *inst;
>>> +            inst = emit(BRW_OPCODE_CMP, null,
>>> +                        regoffset(src[0], 0), fs_reg(0.0f));
>>> +            inst->conditional_mod = BRW_CONDITIONAL_LE;
>>> +
>>> +            if (fpi->DstReg.WriteMask & WRITEMASK_Y) {
>>> +               emit(BRW_OPCODE_MOV, regoffset(dst, 1), regoffset(src[0], 0));
>>> +               inst = emit(BRW_OPCODE_MOV, regoffset(dst, 1), fs_reg(0.0f));
>>> +               inst->predicated = true;
>>> +            }
>>> +
>>> +            if (fpi->DstReg.WriteMask & WRITEMASK_Z) {
>>> +               emit_math(SHADER_OPCODE_POW, regoffset(dst, 2),
>>> +                         regoffset(src[0], 1), regoffset(src[0], 3));
>>> +
>>> +               inst = emit(BRW_OPCODE_MOV, regoffset(dst, 2), fs_reg(0.0f));
>>> +               inst->predicated = true;
>>
>> This looks broken...don't you need to handle clamping to (-128, 128)?
> 
> Ah, I lifted this code directly from the former backend.  I'll put in a
> note that I didn't change behavior.
> 
> I don't know of any use for the -128,128 clamping.  I think it's just a
> spec artifact from this extension being written for a particular
> hardware instruction set.  We should probably fix it for thoroughness,
> but I don't expect app behavior to change.  Do you want to see that done
> in this series?

Ah...you're right, the old backend didn't do it either.  It's definitely
best to match the old behavior in this patch since the goal is to
refactor it, not fix things.  I'd say for now, at least add a note.
It'd be good to fix it, but since it's the same and things appear to be
working just fine, I don't think it should block this series.

>>> +      case OPCODE_SCS:
>>> +         if (fpi->DstReg.WriteMask & WRITEMASK_X) {
>>> +            emit_math(SHADER_OPCODE_COS, regoffset(dst, 0),
>>> +                      regoffset(src[0], 0));
>>> +         }
>>> +
>>> +         if (fpi->DstReg.WriteMask & WRITEMASK_Y) {
>>> +            emit_math(SHADER_OPCODE_SIN, regoffset(dst, 1),
>>> +                      regoffset(src[0], 1));
>>> +         }
>>> +         break;
>>
>> Future optimization: we could use the actual SINCOS math instruction
>> when asking for WRITEMASK_XY.  But I don't know how common that is.
> 
> I haven't seen one yet.  If we did, given that GLSL doesn't have a
> sincos, we probably would want to do it as a peephole optimization.

Well, yeah, that would be nice for GLSL, but it'd also be entirely
trivial to generate the right code here :)  Doesn't need to be done now
though (especially since there's some other code that would need to be
updated for SINCOS).

> (I have seen sincos-applicable shaders on the vertex side, though.  Not
> well-written ones, just examples)