[Mesa-dev] [PATCH 5/8] i965: Implement FS backend for ARB_sample_shading

Fri Oct 18 19:30:40 CEST 2013

On Tue, Oct 15, 2013 at 5:42 PM, Anuj Phogat <anuj.phogat at gmail.com> wrote:
> On Tue, Oct 15, 2013 at 1:49 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
>> On 10/14/2013 10:12 AM, Anuj Phogat wrote:
>>> Implement the FS backend for new builtins added by the extension:
>>> in vec2 gl_SamplePosition
>>> in int gl_SampleID
>>> in int gl_NumSamples
>>> out int gl_SampleMask[]
>>>
>>> Signed-off-by: Anuj Phogat <anuj.phogat at gmail.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_fs.cpp         | 109 +++++++++++++++++++++++++++
>>>  src/mesa/drivers/dri/i965/brw_fs.h           |   4 +
>>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |  23 ++++++
>>>  src/mesa/drivers/dri/i965/brw_wm.h           |   1 +
>>>  4 files changed, 137 insertions(+)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> index e5d6e4b..e4f7745 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> @@ -1115,6 +1115,109 @@ fs_visitor::emit_frontfacing_interpolation(ir_variable *ir)
>>>     return reg;
>>>  }
>>>
>>> +void
>>> +fs_visitor::compute_sample_position(fs_reg dst, fs_reg int_sample_pos)
>>> +{
>>> +   int num_samples = ctx->DrawBuffer->Visual.samples;
>>> +   assert(num_samples >= 0 && num_samples <= 8);
>>> +
>>> +   /* From arb_sample_shading specification:
>>
>> "From the ARB_sample_shading specification:"
>>
>> (when I search for extension names in code, I capitalize ARB/EXT/NV, so
>> this would make a case-sensitive search work)
>>
>>> +    * "When rendering to a non-multisample buffer, or if multisample
>>> +    *  rasterization is disabled, gl_SamplePosition will always be
>>> +    *  (0.5, 0.5).
>>> +    */
>>> +   if (!ctx->Multisample.Enabled || num_samples == 0) {
>>> +      emit(BRW_OPCODE_MOV, dst, fs_reg(0.5f));
>>> +   }
>>> +   else {
>>
>> } else {
>>
>>> +      /* For num_samples = {4, 8} */
>>> +      emit(BRW_OPCODE_MOV, dst, int_sample_pos);
>>> +      emit(BRW_OPCODE_MUL, dst, dst, fs_reg(1 / 16.0f));
>>
>> Ah, I see...the MOV here converts the sample position from <D> to <F>,
>> which is necessary for the next multiply.
>>
>> Adding an assertion would probably make this a little bit more obvious:
>> assert(dst.type == BRW_REGISTER_TYPE_F);
>>
> right.
>>> +   }
>>> +}
>>> +
>>> +fs_reg *
>>> +fs_visitor::emit_samplepos_interpolation(ir_variable *ir)
>>> +{
>>> +   assert(brw->gen >= 6);
>>> +
>>> +   this->current_annotation = "compute sample position";
>>> +   fs_reg *reg = new(this->mem_ctx) fs_reg(this, ir->type);
>>> +   fs_reg pos = *reg;
>>> +   fs_reg int_sample_x = fs_reg(this, glsl_type::int_type);
>>> +   fs_reg int_sample_y = fs_reg(this, glsl_type::int_type);
>>> +
>>> +   /* WM will be run in MSDISPMODE_PERSAMPLE. So, only SIMD8 mode will be
>>
>> Looking at the docs for 3DSTATE_PS, I believe that SIMD16 is allowed in
>> some cases...
>>
>> SNB: Must be SIMD8 only.
>> IVB:
>>  - 4x MSAA: Either SIMD8 or SIMD16 should work.
>>  - 8x MSAA: Must be SIMD8 only.
>>
>> Maybe I'm crazy, though...I haven't looked into it too closely...
>>
> Yes, you're right. I should have explained it here that I've chosen
> to use 'SIMD8 only' for 4x and 8x MSAA on both SNB and IVB.
> I decided to use SIMD8 over SIMD16 in case of 4x MSAA after
> reading following lines in IVB PRM vol 2, part 1, page 330:
> "Object Size: If the number of very small objects (e.g., covering 2
>  subspans or fewer) is  expected to comprise a significant portion
>  of the workload, supporting the 8-pixel dispatch  mode may be
>  advantageous Otherwise there could be a large number of 16-pixel
>  dispatches  with only 1 or 2 valid subspans, resulting in low
>  efficiency for those threads"
>
> Do you have other reasons to suggest using SIMD16 for 4x MSAA
> on IVB?
>>> +    * enabled. The X, Y sample positions come in as bytes in  thread payload.
>>> +    * Sample IDs and sample positions remain same for all four slots in a
>>
>> It would be great to have a documentation reference here...
>>
> I'll add the reference.
>>> +    * subspan. So, read the positions using vstride=2, width=4, hstride=0.
>>> +    */
>>> +   emit(BRW_OPCODE_AND, int_sample_x,
>>> +        fs_reg(stride(retype(brw_vec1_grf(c->sample_pos_reg, 0),
>>> +                             BRW_REGISTER_TYPE_D), 2, 4, 0)),
>>> +        fs_reg(brw_imm_d(0xff)));
>>> +
>>> +   /* Compute gl_SamplePosition.x */
>>> +   compute_sample_position(pos, int_sample_x);
>>> +   pos.reg_offset++;
>>> +
>>> +   emit(BRW_OPCODE_SHR, int_sample_y,
>>> +        fs_reg(stride(retype(brw_vec1_grf(c->sample_pos_reg, 0),
>>> +                             BRW_REGISTER_TYPE_D), 2, 4, 0)),
>>> +        fs_reg(8));
>>> +   emit(BRW_OPCODE_AND, int_sample_y, int_sample_y, fs_reg(brw_imm_d(0xff)));
>>> +
>>> +   /* Compute gl_SamplePosition.y */
>>> +   compute_sample_position(pos, int_sample_y);
>>> +   return reg;
>>> +}
>>> +
>>> +fs_reg *
>>> +fs_visitor::emit_sampleid_interpolation(ir_variable *ir)
>>
>> This...isn't exactly interpolation.  Maybe call it emit_sampleid_setup?
>>
> ok. I'll also change the name of emit_samplepos_interpolation() function.
>>> +{
>>> +   assert(brw->gen >= 6);
>>> +   bool multisampled_fbo = ctx->DrawBuffer->Visual.samples > 1;
>>> +
>>> +   this->current_annotation = "compute sample id";
>>> +   fs_reg *reg = new(this->mem_ctx) fs_reg(this, ir->type);
>>> +
>>> +   if (multisampled_fbo && ctx->Multisample.Enabled) {
>>> +      fs_reg t1 = fs_reg(this, glsl_type::int_type);
>>> +      fs_reg t2 = fs_reg(this, glsl_type::int_type);
>>> +      t2.type = BRW_REGISTER_TYPE_UW;
>>> +
>>> +      /* The WM will be run in MSDISPMODE_PERSAMPLE with num_samples = 8.
>>
>> With num_samples = 8?
>>
>> Also, could you please say "The PS will be run in MSDISPMODE_PERSAMPLE
>> mode"?  On Gen7+, the docs begin referring to the Pixel Shader
>> separately from the Windowizer/Masker, and technically, this is PS state
>> and the PS thread payload.  Minor nit :)
>>
> Language of the sentence is not clear. I'll change it to:
> "The PS will be run in MSDISPMODE_PERSAMPLE. With 8x multisampling
> ...."
>
>>> +       * Therefore, subspan 0 will represent sample N (where N is 0, 2, 4
>>> +       * or 6), subspan 1 will represent sample 1, 3, 5 or 7.  We can find
>>> +       * the value of N by looking at R0.0 bits 7:6 ("Starting Sample Pair
>>> +       * Index (SSPI)") and multiplying by two (since samples are always
>>> +       * delivered in pairs). That is, we compute 2*((R0.0 & 0xc0) >> 6)
>>> +       * == (R0.0 & 0xc0) >> 5.
>>> +       *
>>> +       * Then we need to add N to the sequence (0, 0, 0, 0, 1, 1, 1, 1),
>>> +       * which we compute by populating a temporary variable with the
>>> +       * sequence (0, 1), and then reading from it using vstride=1,
>>> +       * width=4, hstride=0.
>>> +       * Same holds true for num_samples = 4.
>>
>> It sounds like num_samples could actually be 4, so you might want to
>> adjust your comment...
>>
> I'll change it to:
> " These computations hold good for 4x multisampling as well."
>>> +       */
>>> +      emit(BRW_OPCODE_AND, t1,
>>> +           fs_reg(retype(brw_vec1_grf(0, 0), BRW_REGISTER_TYPE_D)),
>>> +           fs_reg(brw_imm_d(0xc0)));
>>> +      emit(BRW_OPCODE_SHR, t1, t1, fs_reg(5));
>>> +      emit(BRW_OPCODE_MOV, t2, fs_reg(brw_imm_v(0x11110000)));
>>> +      emit(BRW_OPCODE_ADD, *reg, t1, t2);
>>
>> This looks right to me.
>>
>>> +   }
>>> +   else {
>>
>> } else {
>>
>>> +      /* As per GL_ARB_sample_shading specification:
>>> +       * "When rendering to a non-multisample buffer, or if multisample
>>> +       *  rasterization is disabled, gl_SampleID will always be zero."
>>> +       */
>>> +      emit(BRW_OPCODE_MOV, *reg, fs_reg(0));
>>> +   }
>>> +
>>> +   return reg;
>>> +}
>>> +
>>>  fs_reg
>>>  fs_visitor::fix_math_operand(fs_reg src)
>>>  {
>>> @@ -2966,7 +3069,13 @@ fs_visitor::setup_payload_gen6()
>>>           c->nr_payload_regs++;
>>>        }
>>>     }
>>> +
>>>     /* R31: MSAA position offsets. */
>>> +   if (fp->Base.InputsRead & VARYING_BIT_SAMPLE_POS) {
>>> +      c->sample_pos_reg = c->nr_payload_regs;
>>> +      c->nr_payload_regs++;
>>> +   }
>>> +
>>>     /* R32-: bary for 32-pixel. */
>>>     /* R58-59: interp W for 32-pixel. */
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>>> index c78f9ae..812e9c7 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>>> @@ -332,9 +332,12 @@ public:
>>>                           glsl_interp_qualifier interpolation_mode,
>>>                           bool is_centroid);
>>>     fs_reg *emit_frontfacing_interpolation(ir_variable *ir);
>>> +   fs_reg *emit_samplepos_interpolation(ir_variable *ir);
>>> +   fs_reg *emit_sampleid_interpolation(ir_variable *ir);
>>>     fs_reg *emit_general_interpolation(ir_variable *ir);
>>>     void emit_interpolation_setup_gen4();
>>>     void emit_interpolation_setup_gen6();
>>> +   void compute_sample_position(fs_reg dst, fs_reg int_sample_pos);
>>>     fs_reg rescale_texcoord(ir_texture *ir, fs_reg coordinate,
>>>                             bool is_rect, int sampler, int texunit);
>>>     fs_inst *emit_texture_gen4(ir_texture *ir, fs_reg dst, fs_reg coordinate,
>>> @@ -432,6 +435,7 @@ public:
>>>
>>>     struct hash_table *variable_ht;
>>>     fs_reg frag_depth;
>>> +   fs_reg sample_mask;
>>>     fs_reg outputs[BRW_MAX_DRAW_BUFFERS];
>>>     unsigned output_components[BRW_MAX_DRAW_BUFFERS];
>>>     fs_reg dual_src_output;
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> index e659203..75ced1d 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> @@ -61,9 +61,14 @@ fs_visitor::visit(ir_variable *ir)
>>>        reg = emit_fragcoord_interpolation(ir);
>>>        } else if (!strcmp(ir->name, "gl_FrontFacing")) {
>>>        reg = emit_frontfacing_interpolation(ir);
>>> +      } else if (!strcmp(ir->name, "gl_SamplePosition")) {
>>> +      reg = emit_samplepos_interpolation(ir);
>>> +      } else if (!strcmp(ir->name, "gl_SampleID")) {
>>> +      reg = emit_sampleid_interpolation(ir);
>>>        } else {
>>>        reg = emit_general_interpolation(ir);
>>>        }
>>> +
>>>        assert(reg);
>>>        hash_table_insert(this->variable_ht, reg, ir);
>>>        return;
>>> @@ -82,6 +87,8 @@ fs_visitor::visit(ir_variable *ir)
>>>        }
>>>        } else if (ir->location == FRAG_RESULT_DEPTH) {
>>>        this->frag_depth = *reg;
>>> +      } else if (ir->location == FRAG_RESULT_SAMPLE_MASK) {
>>> +         this->sample_mask = *reg;
>>>        } else {
>>>        /* gl_FragData or a user-defined FS output */
>>>        assert(ir->location >= FRAG_RESULT_DATA0 &&
>>> @@ -2502,6 +2509,22 @@ fs_visitor::emit_fb_writes()
>>>        pop_force_uncompressed();
>>>     }
>>>
>>> +   if (fp->Base.OutputsWritten & BITFIELD64_BIT(FRAG_RESULT_SAMPLE_MASK)) {
>>> +      this->current_annotation = "FB write oMask";
>>> +      assert(this->sample_mask.file != BAD_FILE);
>>> +      fs_reg reg = fs_reg(this, glsl_type::int_type);
>>> +
>>> +      /* Hand over gl_SampleMask. Only lower 16 bits are relevant. */
>>> +      emit(AND(this->sample_mask, this->sample_mask, fs_reg(255)));
>>
>> If the lower 16 bits are relevant, then why are you masking off all but
>> the lower 8 bits?  Today, we only support 8x MSAA, but we may want 16x
>> for future hardware.
>>
>> I think I'd prefer to see fs_reg(0xffff), unless there's some reason why
>> that doesn't work.
>>
> Yes, I should use 0xffff here. Thanks for catching it.
>
>>> +      emit(SHL(reg, this->sample_mask, fs_reg(16)));
>>
>> I'm having a hard time understanding why you need to shift by 16 here.
>>
>> My understanding is that gl_SampleMask[1] is single 32-bit integer,
>> where bit N == 0 means that sample N is uncovered.  The oMask payload
>> has the same meaning, but is only a 16-bit integer (since we don't
>> support 32x MSAA).
>>
>> Since we run in SIMD8 mode, sample_mask actually contains 8 integers (or
>> 16 for SIMD16) - one for each pixel.  oMask also contains 16 "slots"
>> (one per pixel - slots 8-15 are unused for SIMD8).
>>
>> So sample_mask looks like:
>>        7       6       5       4       3       2       1       0
>> hhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxx
>>
>> Masking off the top 16-bits gives:
>> 0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx
>>
>> The oMask payload is packed, and looks like:
>> ________________________________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>>
>> The SHL should produce:
>> xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000
>>
> We enable both SIMD8 and SIMD16 mode with gl_SampleMask[] because
> it doesn't require MSDISPMODE_PERSAMPLE.
>
> for SIMD8
> sample_mask looks like:
>         7       6       5       4       3       2       1       0
>  hhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxxhhhhxxxx
>
>  Masking off the top 16-bits gives:
>  0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx
>
>  SHL will produce:
>  xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000xxxx0000
>
> OR will produce:
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> Relevant data for oMask paylaod is:
> ________________________________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> rest of the data is ignored.
> Let me know if I'm making some crazy mistake here.
>
Ken, did this help you to understand what I'm trying to do here?

>>> +
>>> +      this->sample_mask.type = BRW_REGISTER_TYPE_UW;
>>> +      reg.type = BRW_REGISTER_TYPE_UW;
>>> +      emit(OR(fs_reg(MRF, nr, BRW_REGISTER_TYPE_UW), this->sample_mask, reg));
>>
>> ...which you then OR into an uninitialized MRF (AFAICT), and...I don't
>> think this is going to work.
> This seems to work with the piglit test case (builtin-gl-sample-mask.cpp)
> which I developed for gl_SampleMask. I also verified the register data.
>
>>
>> I'm pretty sure you want MOV, not OR, and you want some sort of special
>> stride on the sample_mask source register.

I know we can specify stride if we have a brw_reg :-
fs_reg (stride(brw_vec1_grf(0, 0), 2, 4, 0));

But I could not  find a reliable way to use stride if we have a fs_reg.
That's why I used OR to get the desired result.
>>> +      nr += 1;
>>> +   }
>>> +
>>>     /* Reserve space for color. It'll be filled in per MRT below. */
>>>     int color_mrf = nr;
>>>     nr += 4 * reg_width;
>>> diff --git a/src/mesa/drivers/dri/i965/brw_wm.h b/src/mesa/drivers/dri/i965/brw_wm.h
>>> index aa786de..fda74ab 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_wm.h
>>> +++ b/src/mesa/drivers/dri/i965/brw_wm.h
>>> @@ -83,6 +83,7 @@ struct brw_wm_compile {
>>>     uint8_t source_w_reg;
>>>     uint8_t aa_dest_stencil_reg;
>>>     uint8_t dest_depth_reg;
>>> +   uint8_t sample_pos_reg;
>>>     uint8_t barycentric_coord_reg[BRW_WM_BARYCENTRIC_INTERP_MODE_COUNT];
>>>     uint8_t nr_payload_regs;
>>>     GLuint source_depth_to_render_target:1;
>>>
>>