[Mesa-dev] [PATCH 5/8] i965: Implement FS backend for ARB_sample_shading

Thu Oct 24 22:56:56 CEST 2013

On Sat, Oct 19, 2013 at 3:05 PM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 14 October 2013 10:12, Anuj Phogat <anuj.phogat at gmail.com> wrote:
>>
>> Implement the FS backend for new builtins added by the extension:
>> in vec2 gl_SamplePosition
>> in int gl_SampleID
>> in int gl_NumSamples
>> out int gl_SampleMask[]
>
>
> There is a lot going on in this one patch, and it's getting hard to follow
> all the patch review that's going on.  If you wind up re-spinning this patch
> series, can you please break it into four separate patches, one to add
> support for each builtin?
>
Yes, I'll split it in to four patches.
>>
>>
>> Signed-off-by: Anuj Phogat <anuj.phogat at gmail.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.cpp         | 109
>> +++++++++++++++++++++++++++
>>  src/mesa/drivers/dri/i965/brw_fs.h           |   4 +
>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |  23 ++++++
>>  src/mesa/drivers/dri/i965/brw_wm.h           |   1 +
>>  4 files changed, 137 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index e5d6e4b..e4f7745 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -1115,6 +1115,109 @@
>> fs_visitor::emit_frontfacing_interpolation(ir_variable *ir)
>>     return reg;
>>  }
>>
>> +void
>> +fs_visitor::compute_sample_position(fs_reg dst, fs_reg int_sample_pos)
>> +{
>> +   int num_samples = ctx->DrawBuffer->Visual.samples;
>
>
> This isn't safe.  When compilation depends on a piece of GL state, you need
> to include the state in brw_wm_prog_key.  Otherwise the program won't get
> recompiled when the value changes.
>
> To avoid unnecessary recompiles, I'd recommend adding a boolean to
> brw_wm_prog_key which is:
> - true if ctx->Multisample.Enabled && num_samples != 0 && (shader accesses
> gl_SamplePosition)
> - false otherwise
>
yes, this make sense.
>>
>> +   assert(num_samples >= 0 && num_samples <= 8);
>> +
>> +   /* From arb_sample_shading specification:
>> +    * "When rendering to a non-multisample buffer, or if multisample
>> +    *  rasterization is disabled, gl_SamplePosition will always be
>> +    *  (0.5, 0.5).
>> +    */
>> +   if (!ctx->Multisample.Enabled || num_samples == 0) {
>> +      emit(BRW_OPCODE_MOV, dst, fs_reg(0.5f));
>
>
> It looks like you're using the old, more verbose style of emitting
> instructions.  Can we convert this (and the other instructions in this
> patch) to the more compact style:
>
> emit(MOV(dst, fs_reg(0.5f)));
>
ok.
>>
>> +   }
>> +   else {
>> +      /* For num_samples = {4, 8} */
>> +      emit(BRW_OPCODE_MOV, dst, int_sample_pos);
>> +      emit(BRW_OPCODE_MUL, dst, dst, fs_reg(1 / 16.0f));
>
>
> Like Ken, I was confused as to why we needed a separate MOV before the MUL.
> The assertion Ken recommended would have been a useful clue, but I'd prefer
> to just have explicit comments explaining why we need the MOV.  How about
> this:
>
> /* Convert int_sample_pos to floating point */
> emit(BRW_OPCODE_MOV, dst, int_sample_pos);
> /* Scale to the range [0, 1] */
> emit(BRW_OPCODE_MUL, dst, dst, fs_reg(1 / 16.0f));
>
>>
>> +   }
>> +}
>> +
>> +fs_reg *
>> +fs_visitor::emit_samplepos_interpolation(ir_variable *ir)
>> +{
>> +   assert(brw->gen >= 6);
>> +
>> +   this->current_annotation = "compute sample position";
>> +   fs_reg *reg = new(this->mem_ctx) fs_reg(this, ir->type);
>
>
> Since this code assigns to two consecutive registers, it relies on the fact
> that ir->type is vec2.  Just to make that explicit, can we add:
>
> assert(ir->type == glsl_type::vec2_type);
>
>>
>> +   fs_reg pos = *reg;
>> +   fs_reg int_sample_x = fs_reg(this, glsl_type::int_type);
>> +   fs_reg int_sample_y = fs_reg(this, glsl_type::int_type);
>> +
>> +   /* WM will be run in MSDISPMODE_PERSAMPLE. So, only SIMD8 mode will be
>> +    * enabled. The X, Y sample positions come in as bytes in  thread
>> payload.
>> +    * Sample IDs and sample positions remain same for all four slots in a
>> +    * subspan. So, read the positions using vstride=2, width=4,
>> hstride=0.
>
>
> I have similar concerns to Ken about why MSDISPMODE_PERSAMPLE implies that
> only SIMD8 mode will be enabled.  I'm assuming the two of you have
> adequately resolved that.
>
I was facing few problems with enabling 'SIMD16 only' dispatch. So, we agreed
to enable 'SIMD8 only' dispatch in this series. I'll do SIMD16 in a followup
patch. I'll update this comment with correct information.
>>
>> +    */
>> +   emit(BRW_OPCODE_AND, int_sample_x,
>> +        fs_reg(stride(retype(brw_vec1_grf(c->sample_pos_reg, 0),
>> +                             BRW_REGISTER_TYPE_D), 2, 4, 0)),
>> +        fs_reg(brw_imm_d(0xff)));
>
>
> If I understand correctly, this is creating the instruction
>
> AND(8) int_sample_x<1>:d sample_pos_reg<2;4,0>:d 0x000000ff:d
>
> I think this works, but it would be more straightforward to do this, which
> just selects the X coordinate bytes from the sample_pos_reg register:
>
> MOV(8) int_sample_x<1>:d sample_pos_reg<16;8,2>:b
>
> That would have the advantage that it doesn't rely on the fact that sample
> IDs and sample positions are the same for all four slots in a subspan.
>
> (Note: there are some weird restrictions on byte operations, and I can't
> remember all the details.  So you might need to try this in the simulator
> and tweak it to get it to work.)
>
yes, this looks better. I didn't think about using a byte operation. Thanks.
>>
>> +
>> +   /* Compute gl_SamplePosition.x */
>> +   compute_sample_position(pos, int_sample_x);
>> +   pos.reg_offset++;
>
>
> I think if we change this to:
>
> pos.reg_offset += dispatch_width / 8;
>
> then the code will work correctly for both SIMD8 and SIMD16 execution modes.
>
right.
>>
>> +
>> +   emit(BRW_OPCODE_SHR, int_sample_y,
>> +        fs_reg(stride(retype(brw_vec1_grf(c->sample_pos_reg, 0),
>> +                             BRW_REGISTER_TYPE_D), 2, 4, 0)),
>> +        fs_reg(8));
>> +   emit(BRW_OPCODE_AND, int_sample_y, int_sample_y,
>> fs_reg(brw_imm_d(0xff)));
>
>
> Similarly, these two instructions could be replaced with simply:
>
> MOV(8) int_sample_y<1>:d sample_pos_reg.1<16;8,2>:b
>
> Which just selects the Y coordintae bytes from the sample_pos_register.
>
>>
>> +
>> +   /* Compute gl_SamplePosition.y */
>> +   compute_sample_position(pos, int_sample_y);
>> +   return reg;
>> +}
>> +
>> +fs_reg *
>> +fs_visitor::emit_sampleid_interpolation(ir_variable *ir)
>> +{
>> +   assert(brw->gen >= 6);
>> +   bool multisampled_fbo = ctx->DrawBuffer->Visual.samples > 1;
>
>
> As I mentioned in my comments on compute_sample_position(), it's not safe to
> consult the context like this.  This information needs to go in the program
> key.
>
>>
>> +
>> +   this->current_annotation = "compute sample id";
>> +   fs_reg *reg = new(this->mem_ctx) fs_reg(this, ir->type);
>> +
>> +   if (multisampled_fbo && ctx->Multisample.Enabled) {
>
>
> This too.
>
>>
>> +      fs_reg t1 = fs_reg(this, glsl_type::int_type);
>> +      fs_reg t2 = fs_reg(this, glsl_type::int_type);
>> +      t2.type = BRW_REGISTER_TYPE_UW;
>> +
>> +      /* The WM will be run in MSDISPMODE_PERSAMPLE with num_samples = 8.
>> +       * Therefore, subspan 0 will represent sample N (where N is 0, 2, 4
>> +       * or 6), subspan 1 will represent sample 1, 3, 5 or 7.  We can
>> find
>> +       * the value of N by looking at R0.0 bits 7:6 ("Starting Sample
>> Pair
>> +       * Index (SSPI)") and multiplying by two (since samples are always
>> +       * delivered in pairs). That is, we compute 2*((R0.0 & 0xc0) >> 6)
>> +       * == (R0.0 & 0xc0) >> 5.
>> +       *
>> +       * Then we need to add N to the sequence (0, 0, 0, 0, 1, 1, 1, 1),
>> +       * which we compute by populating a temporary variable with the
>> +       * sequence (0, 1), and then reading from it using vstride=1,
>> +       * width=4, hstride=0.
>> +       * Same holds true for num_samples = 4.
>> +       */
>> +      emit(BRW_OPCODE_AND, t1,
>> +           fs_reg(retype(brw_vec1_grf(0, 0), BRW_REGISTER_TYPE_D)),
>> +           fs_reg(brw_imm_d(0xc0)));
>> +      emit(BRW_OPCODE_SHR, t1, t1, fs_reg(5));
>> +      emit(BRW_OPCODE_MOV, t2, fs_reg(brw_imm_v(0x11110000)));
>>
>> +      emit(BRW_OPCODE_ADD, *reg, t1, t2);
>
>
> I believe this works, but it's not what the comment says you're doing.  It
> looks like you borrowed the comment from blorp, but I think that what you've
> actually done is actually a lot simpler (and better) than what blorp does.
> Rather than populate a register with (0, 1) and then reading from it using
> vstride=1, width=4, hstride=0, you're just populating the register with (0,
> 0, 0, 0, 1, 1, 1, 1) directly.  Pleas update the comment accordingly.
>
Yes, last few lines are not applicable here. I'll update the comment.

> For SIMD16 mode, we'll need to do something slightly more complicated, since
> we'll need t2 (which will be a pair of registers in SIMD16 mode) to contain
> (0, 0, 0, 0, 1, 1, 1, 1) in the first register and (2, 2, 2, 2, 3, 3, 3, 3)
> in the second register.  I *think* you can do this with something like:
>
> push_force_uncompressed();
> emit(MOV(t2, fs_reg(brw_imm_v(0x11110000))));
> push_force_sechalf();
> t2.reg_offset++;
> emit(MOV(t2, fs_reg(brw_imm_v(0x33332222))));
> pop_force_sechalf();
> pop_force_uncompressed();
>
> But please double check in the simulator to make sure it does the right
> thing :)
>
This won't work as it is because t2 is a UW register and it's also a requirement
when working with immediate vectors. It looks like what you've done in blorp
is the right thing to do here. i.e. populate a register with (0, 1, 2, 3) and
then reading from it using vstride=1, width=4, hstride=0. I think I've to create
a special FS backend instruction to adjust the stride to (1, 4, 0).
>>
>> +   }
>> +   else {
>> +      /* As per GL_ARB_sample_shading specification:
>> +       * "When rendering to a non-multisample buffer, or if multisample
>> +       *  rasterization is disabled, gl_SampleID will always be zero."
>> +       */
>> +      emit(BRW_OPCODE_MOV, *reg, fs_reg(0));
>> +   }
>> +
>> +   return reg;
>> +}
>> +
>>  fs_reg
>>  fs_visitor::fix_math_operand(fs_reg src)
>>  {
>> @@ -2966,7 +3069,13 @@ fs_visitor::setup_payload_gen6()
>>           c->nr_payload_regs++;
>>        }
>>     }
>> +
>>     /* R31: MSAA position offsets. */
>> +   if (fp->Base.InputsRead & VARYING_BIT_SAMPLE_POS) {
>> +      c->sample_pos_reg = c->nr_payload_regs;
>> +      c->nr_payload_regs++;
>> +   }
>> +
>>     /* R32-: bary for 32-pixel. */
>>     /* R58-59: interp W for 32-pixel. */
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h
>> b/src/mesa/drivers/dri/i965/brw_fs.h
>> index c78f9ae..812e9c7 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>> @@ -332,9 +332,12 @@ public:
>>                           glsl_interp_qualifier interpolation_mode,
>>                           bool is_centroid);
>>     fs_reg *emit_frontfacing_interpolation(ir_variable *ir);
>> +   fs_reg *emit_samplepos_interpolation(ir_variable *ir);
>> +   fs_reg *emit_sampleid_interpolation(ir_variable *ir);
>>     fs_reg *emit_general_interpolation(ir_variable *ir);
>>     void emit_interpolation_setup_gen4();
>>     void emit_interpolation_setup_gen6();
>> +   void compute_sample_position(fs_reg dst, fs_reg int_sample_pos);
>>     fs_reg rescale_texcoord(ir_texture *ir, fs_reg coordinate,
>>                             bool is_rect, int sampler, int texunit);
>>     fs_inst *emit_texture_gen4(ir_texture *ir, fs_reg dst, fs_reg
>> coordinate,
>> @@ -432,6 +435,7 @@ public:
>>
>>     struct hash_table *variable_ht;
>>     fs_reg frag_depth;
>> +   fs_reg sample_mask;
>>     fs_reg outputs[BRW_MAX_DRAW_BUFFERS];
>>     unsigned output_components[BRW_MAX_DRAW_BUFFERS];
>>     fs_reg dual_src_output;
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> index e659203..75ced1d 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> @@ -61,9 +61,14 @@ fs_visitor::visit(ir_variable *ir)
>>          reg = emit_fragcoord_interpolation(ir);
>>        } else if (!strcmp(ir->name, "gl_FrontFacing")) {
>>          reg = emit_frontfacing_interpolation(ir);
>> +      } else if (!strcmp(ir->name, "gl_SamplePosition")) {
>> +        reg = emit_samplepos_interpolation(ir);
>> +      } else if (!strcmp(ir->name, "gl_SampleID")) {
>> +        reg = emit_sampleid_interpolation(ir);
>>        } else {
>>          reg = emit_general_interpolation(ir);
>>        }
>> +
>>        assert(reg);
>>        hash_table_insert(this->variable_ht, reg, ir);
>>        return;
>> @@ -82,6 +87,8 @@ fs_visitor::visit(ir_variable *ir)
>>          }
>>        } else if (ir->location == FRAG_RESULT_DEPTH) {
>>          this->frag_depth = *reg;
>> +      } else if (ir->location == FRAG_RESULT_SAMPLE_MASK) {
>> +         this->sample_mask = *reg;
>>        } else {
>>          /* gl_FragData or a user-defined FS output */
>>          assert(ir->location >= FRAG_RESULT_DATA0 &&
>> @@ -2502,6 +2509,22 @@ fs_visitor::emit_fb_writes()
>>        pop_force_uncompressed();
>>     }
>>
>> +   if (fp->Base.OutputsWritten & BITFIELD64_BIT(FRAG_RESULT_SAMPLE_MASK))
>> {
>> +      this->current_annotation = "FB write oMask";
>> +      assert(this->sample_mask.file != BAD_FILE);
>> +      fs_reg reg = fs_reg(this, glsl_type::int_type);
>> +
>> +      /* Hand over gl_SampleMask. Only lower 16 bits are relevant. */
>> +      emit(AND(this->sample_mask, this->sample_mask, fs_reg(255)));
>> +      emit(SHL(reg, this->sample_mask, fs_reg(16)));
>> +
>> +      this->sample_mask.type = BRW_REGISTER_TYPE_UW;
>>
>> +      reg.type = BRW_REGISTER_TYPE_UW;
>> +      emit(OR(fs_reg(MRF, nr, BRW_REGISTER_TYPE_UW), this->sample_mask,
>> reg));
>
>
> I agree with Ken that this is going to do the wrong thing.  You are filling
> up all the data in the oMask with values, but you're not filling it with the
> right values.  You're delivering:
>
> sample_mask slot 0 -> oMask slot 0 and 1
> sample_mask slot 1 -> oMask slot 2 and 3
> sample_mask slot 2 -> oMask slot 4 and 5
> sample_mask slot 3 -> oMask slot 6 and 7
> and sample_mask slots 4-7 are getting delivered to oMask slots 8-15, which
> have no effect in SIMD8 mode.
>
> That's definitely not what we want.
>
Yes, I made a false assumption that all the slots will have a same oMask
value. But someone may want to use per fragment sample mask and that's
the purpose of putting it in a fragment shader.

> I believe that what we need can actually be done in a single instruction.
> For SIMD8:
>
> MOV(8) oMask<1>:uw sample_mask<16;8,2>:uw
>
> and for SIMD16:
>
> MOV(16) oMask<1>:uw sample_mask<16;8,2>:uw
>
> To get that to happen you'll need to create a new FS backend opcode, so that
> you can adjust the stride on sample_mask.
>
Thanks for explaining. I needed exactly this information.
>>
>> +
>> +      nr += 1;
>> +   }
>> +
>>     /* Reserve space for color. It'll be filled in per MRT below. */
>>     int color_mrf = nr;
>>     nr += 4 * reg_width;
>> diff --git a/src/mesa/drivers/dri/i965/brw_wm.h
>> b/src/mesa/drivers/dri/i965/brw_wm.h
>> index aa786de..fda74ab 100644
>> --- a/src/mesa/drivers/dri/i965/brw_wm.h
>> +++ b/src/mesa/drivers/dri/i965/brw_wm.h
>> @@ -83,6 +83,7 @@ struct brw_wm_compile {
>>     uint8_t source_w_reg;
>>     uint8_t aa_dest_stencil_reg;
>>     uint8_t dest_depth_reg;
>> +   uint8_t sample_pos_reg;
>>     uint8_t barycentric_coord_reg[BRW_WM_BARYCENTRIC_INTERP_MODE_COUNT];
>>     uint8_t nr_payload_regs;
>>     GLuint source_depth_to_render_target:1;
>> --
>> 1.8.1.4
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>