[Mesa-dev] [PATCH V3 1/2] i965/blorp: Add bilinear filtering of samples for multisample scaled blits

Paul Berry stereotype441 at gmail.com
Tue Jun 25 10:27:08 PDT 2013


On 19 June 2013 19:45, Anuj Phogat <anuj.phogat at gmail.com> wrote:

> Current implementation of ext_framebuffer_multisample_blit_scaled in
> i965/blorp uses nearest filtering for multisample scaled blits. Using
> nearest filtering produces blocky artifacts and negates the benefits
> of MSAA. That is the reason why extension was not enabled on i965.
>
> This patch implements the bilinear filtering of samples in blorp engine.
> Images generated with this patch are free from blocky artifacts and show
> big improvement in visual quality.
>
> Observed no piglit and gles3 regressions.
>
> V3:
> - Algorithm used for filtering assumes a rectangular grid of samples
>   roughly corresponding to sample locations.
> - Test the boundary conditions on the edges of texture.
>
> Signed-off-by: Anuj Phogat <anuj.phogat at gmail.com>
>

Thanks for all your effort on this, Anuj.  I think we have an algorithm
that's going to get us good quality.  I have a number of minor comments
below, but I think the basic approach is good.


> ---
>  src/mesa/drivers/dri/i965/brw_blorp.h        |  11 ++
>  src/mesa/drivers/dri/i965/brw_blorp_blit.cpp | 257
> +++++++++++++++++++++++++--
>  2 files changed, 258 insertions(+), 10 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.h
> b/src/mesa/drivers/dri/i965/brw_blorp.h
> index ffc27cc..0a15b89 100644
> --- a/src/mesa/drivers/dri/i965/brw_blorp.h
> +++ b/src/mesa/drivers/dri/i965/brw_blorp.h
> @@ -319,6 +319,17 @@ struct brw_blorp_blit_prog_key
>      * than one sample per pixel.
>      */
>     bool persample_msaa_dispatch;
> +
> +   /* True for scaled blitting. */
> +   bool blit_scaled;
> +
> +   /* Source rectangle dimensions. Used to test boundary conditions in
> shader
> +    * program.
> +    */
> +   float src_x0;
> +   float src_y0;
> +   float src_x1;
> +   float src_y1;
>

Two comments about this:

(1) Rather than clamp to the boundary of the source rectangle, I have a
minor preference for clamping to the boundary of the source image.  This is
what I've observed in nVidia's blitter, and it's what's implemented in
Mesa's meta path for scaled blits.  Also, it would have the benefit of
simplifying some of the logic below, since the lower clamp will always be 0.

(2) We can't put these numbers in the program key, since they're going to
change frequently, and changing the program key causes the program to be
recompiled.  They need to go in push constants, like we do for the
destination rectangle coordinates and the multipliers and offsets.


>
>  };
>
>  class brw_blorp_blit_params : public brw_blorp_params
> diff --git a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
> b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
> index 8694128..e99c9df 100644
> --- a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
> @@ -622,7 +622,8 @@ private:
>     void kill_if_outside_dst_rect();
>     void translate_dst_to_src();
>     void single_to_blend();
> -   void manual_blend(unsigned num_samples);
> +   void manual_blend_average(unsigned num_samples);
> +   void manual_blend_linear(unsigned num_samples);
>

Minor nit pick: can we call this function manual_blend_bilinear()?
"linear" is such a generic term that it's hard to guess what it does, but
"bilinear" makes it obvious.


>     void sample(struct brw_reg dst);
>     void texel_fetch(struct brw_reg dst);
>     void mcs_fetch();
> @@ -676,6 +677,16 @@ private:
>      */
>     struct brw_reg y_coords[2];
>
> +   /* X, Y coordinates of the pixel from which we need to fetch the
> specific
> +    *  sample. These are used for multisample scaled blitting.
> +    */
> +   struct brw_reg x_sample_coords;
> +   struct brw_reg y_sample_coords;
> +
> +   /* Store the values to use to interpolate in x and y directions */
> +   struct brw_reg x_lerp;
> +   struct brw_reg y_lerp;
> +
>

I had trouble guessing what the "lerp" variables are used for.  How about
something like this instead?

/* Fractional parts of the x and y coordinates, used as bilinear
interpolation coefficients */
struct brw_reg x_frac;
struct brw_reg y_frac;


>     /* Which element of x_coords and y_coords is currently in use.
>      */
>     int xy_coord_index;
> @@ -814,15 +825,17 @@ brw_blorp_blit_program::compile(struct brw_context
> *brw,
>      * that we want to texture from.  Exception: if we are blending, then
> S is
>      * irrelevant, because we are going to fetch all samples.
>      */
> -   if (key->blend) {
> +   if (key->blend && !key->blit_scaled) {
>        if (brw->intel.gen == 6) {
>           /* Gen6 hardware an automatically blend using the SAMPLE message
> */
>           single_to_blend();
>           sample(texture_data[0]);
>        } else {
>           /* Gen7+ hardware doesn't automaticaly blend. */
> -         manual_blend(key->src_samples);
> +         manual_blend_average(key->src_samples);
>        }
> +   } else if(key->blend && key->blit_scaled) {
> +      manual_blend_linear(key->src_samples);
>     } else {
>        /* We aren't blending, which means we just want to fetch a single
> sample
>         * from the source surface.  The address that we want to fetch from
> is
> @@ -913,6 +926,18 @@ brw_blorp_blit_program::alloc_regs()
>           = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
>        reg += 2;
>     }
> +
> +   if (key->blit_scaled && key->blend) {
> +      this->x_sample_coords = brw_vec8_grf(reg, 0);
> +      reg += 2;
> +      this->y_sample_coords = brw_vec8_grf(reg, 0);
> +      reg += 2;
> +      this->x_lerp = brw_vec8_grf(reg, 0);
> +      reg += 2;
> +      this->y_lerp = brw_vec8_grf(reg, 0);
> +      reg += 2;
> +   }
> +
>     this->xy_coord_index = 0;
>     this->sample_index
>        = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
> @@ -1368,11 +1393,82 @@ brw_blorp_blit_program::translate_dst_to_src()
>     brw_MUL(&func, Y_f, Yp_f, y_transform.multiplier);
>     brw_ADD(&func, X_f, X_f, x_transform.offset);
>     brw_ADD(&func, Y_f, Y_f, y_transform.offset);
> -   /* Round the float coordinates down to nearest integer by moving to
> -    * UD registers.
> -    */
> -   brw_MOV(&func, Xp, X_f);
> -   brw_MOV(&func, Yp, Y_f);
> +   if (key->blit_scaled && key->blend) {
> +      float x_scale = 2.0;
> +      float y_scale = key->src_samples / 2.0;
>

It would be nice to have a comment above x_scale and y_scale explaining
that they are the scale factor between the pixel grid and the grid of
samples that we're texturing from.

Also, I'd recommend moving them to someplace more global (either the class
definition or the program key) so that they can be accessed from
manual_blend_linear().


> +      /* Translate coordinates to lay out the samples in a rectangular
>  grid
> +       * roughly corresponding to sample locations.
>
+       */
> +      brw_ADD(&func, X_f, X_f, brw_imm_f(-0.25));
> +      brw_ADD(&func, Y_f, Y_f, brw_imm_f(-1.0 / key->src_samples));
> +      brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
> +      brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));
>

The additive constants -0.25 and -1.0/key->src_samples are a bit difficult
to understand.  I believe this is equivalent, and would be easier to follow:

/* Translate coordinates to lay out the samples in a rectangular grid
 * roughly corresponding to sample locations.
 */
brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));

/* Adjust coordinates so that integers represent pixel centers rather
 * than pixel edges.
 */
brw_ADD(&func, X_f, X_f, brw_imm_f(-0.5));
brw_ADD(&func, Y_f, Y_f, brw_imm_f(-0.5));


> +
> +      /* Test boundary conditions to properly handle the sampling of
> texels on
> +       * texture edges.
> +       */
> +      brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
> +              X_f, brw_imm_f(2.0  * key->src_x0));
> +      brw_IF(&func, BRW_EXECUTE_16);
> +      {
> +         /* Left Edge in X */
> +         brw_MOV(&func, x_lerp, brw_imm_f(1.0));
> +      }
> +      brw_ELSE(&func);
> +      {
> +         brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE,
> +                 X_f, brw_imm_f(2.0 * (key->src_x1 - 0.50)));
> +         brw_IF(&func, BRW_EXECUTE_16);
> +         {
> +            /* Right Edge in X */
> +            brw_MOV(&func, x_lerp, brw_imm_f(0.0));
> +         }
> +         brw_ELSE(&func);
> +         {
> +            /* Store the fractional part to be used as color interpolator
> */
> +            brw_FRC(&func, x_lerp, X_f);
> +         }
> +         brw_ENDIF(&func);
> +      }
> +      brw_ENDIF(&func);
>
+
> +      brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
> +              Y_f, brw_imm_f((y_scale) * key->src_y0));
> +      brw_IF(&func, BRW_EXECUTE_16);
> +      {
> +         /* Bottom Edge in Y */
> +         brw_MOV(&func, y_lerp, brw_imm_f(1.0));
> +      }
> +      brw_ELSE(&func);
> +      {
> +         brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE, Y_f,
> +                 brw_imm_f((y_scale) * (key->src_y1 - 1 / y_scale)));
> +         brw_IF(&func, BRW_EXECUTE_16);
> +         {
> +            /* Top Edge in Y */
> +            brw_MOV(&func, y_lerp, brw_imm_f(0.0));
> +         }
> +         brw_ELSE(&func);
> +         {
> +            /* Store the fractional part to be used as color interpolator
> */
> +            brw_FRC(&func, y_lerp, Y_f);
> +         }
> +         brw_ENDIF(&func);
> +      }
> +      brw_ENDIF(&func);
>

A few thoughts on the above conditional logic:

(1) I think the math would be easier to follow if the following expressions
were changed to equivalent forms:

"2.0 * key->src_x0" => "x_scale * key->src_x0"
"2.0 * (key->src_x1 - 0.50)" => "x_scale * key->src_x1 - 1.0"
"(y_scale) * (key->src_y1 - 1 / y_scale)" => "y_scale * key->src_y1 - 1.0"

This will also help in the future when we want to support hardware with 16x
MSAA, because in that case we'll want x_scale = y_scale = 4.

(2) Effectively what this code is doing is clamping the x and y coordinates
to the boundary of the source rectangle.  But since it's accomplishing the
clamping by adjusting x_lerp to 0.0 or 1.0, can only clamp things that
aren't too far outside the source rectangle to begin with (i.e. no more
than 1.0 outside the boundary).  I *think* that will work, but it seems
like an unnecessary risk (and an unnecessary source of confusion).  Why not
just clamp X_f to the range [x_scale * key->src_x0, x_scale * key->src_x1 -
1.0] (and similar for Y_f), and then afterward compute x_lerp = FRC(X_f)?
That way there will be no doubt that the coordinate is properly in range.

(3) You can probably save a lot of instructions if you use conditional
moves to do the clamping, e.g.

brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, X_f,
brw_imm_f(x_scale * key->src_x0));
brw_MOV(&func, X_f, brw_imm_f(x_scale * key->src_x0));
brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
...etc.


> +
> +     /* Round the float coordinates down to nearest integer */
> +     brw_RNDD(&func, Xp_f, X_f);
> +     brw_RNDD(&func, Yp_f, Y_f);
> +     brw_MUL(&func, X_f, Xp_f, brw_imm_f(0.5));
> +     brw_MUL(&func, Y_f, Yp_f, brw_imm_f(1 / y_scale));
> +   } else {
> +      /* Round the float coordinates down to nearest integer by moving to
> +       * UD registers.
> +       */
> +      brw_MOV(&func, Xp, X_f);
> +      brw_MOV(&func, Yp, Y_f);
> +   }
>     SWAP_XY_AND_XPYP();
>     brw_set_compression_control(&func, BRW_COMPRESSION_NONE);
>  }
> @@ -1418,7 +1514,7 @@ inline int count_trailing_one_bits(unsigned value)
>
>
>  void
> -brw_blorp_blit_program::manual_blend(unsigned num_samples)
> +brw_blorp_blit_program::manual_blend_average(unsigned num_samples)
>  {
>     if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>        mcs_fetch();
> @@ -1523,6 +1619,135 @@ brw_blorp_blit_program::manual_blend(unsigned
> num_samples)
>        brw_ENDIF(&func);
>  }
>
> +void
> +brw_blorp_blit_program::manual_blend_linear(unsigned num_samples)
> +{
> +   if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
> +      mcs_fetch();
>

This won't work.  The MCS value we fetch has to match up with the pixel
that we're sampling from.  Since this function samples from different
pixels in each iteration of the "for" loop below, the call to mcs_fetch()
needs to go inside the loop, and it needs to happen after storing the
coordinates in X and Y.


> +
> +   /* We do this computation by performing the following operations:
> +    *
> +    * In case of 4x, 8x MSAA:
> +    * - Compute the pixel coordinates and sample numbers (a, b, c, d)
> +    *   which are later used for interpolation
> +    * - linearly interpolate samples a and b in X
> +    * - linearly interpolate samples c and d in X
> +    * - linearly interpolate the results of last two operations in Y
> +    *
> +    *   result = lrp(lrp(a + b) + lrp(c + d))
> +    */
> +   struct brw_reg Xp_f = retype(Xp, BRW_REGISTER_TYPE_F);
> +   struct brw_reg Yp_f = retype(Yp, BRW_REGISTER_TYPE_F);
> +   struct brw_reg t1_f = retype(t1, BRW_REGISTER_TYPE_F);
> +   struct brw_reg t2_f = retype(t2, BRW_REGISTER_TYPE_F);
> +
> +   for (unsigned i = 0; i < 4; ++i) {
> +      assert(i < ARRAY_SIZE(texture_data));
> +      s_is_zero = false;
> +
> +      /* Compute pixel coordinates */
> +      brw_ADD(&func, vec16(x_sample_coords), Xp_f,
> +              brw_imm_f((float)(i & 0x1) * 0.5));
> +      brw_ADD(&func, vec16(y_sample_coords), Yp_f,
> +              brw_imm_f((float)(i & 0x2) * (1.0 / num_samples)));
> +      brw_MOV(&func, vec16(X), x_sample_coords);
> +      brw_MOV(&func, vec16(Y), y_sample_coords);
> +
> +      /* Compute sample index. In case of 4x MSAA, sample index is
> +       * equal to sample number.
> +       */
>

This comment should explain what you mean by "sample index" vs "sample
number".


> +      brw_FRC(&func, vec16(t1_f), x_sample_coords);
> +      brw_FRC(&func, vec16(t2_f), y_sample_coords);
> +      brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(2.0));
> +      brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(num_samples));
>

How about this instead?  It's a little more future proof and it clarifies
that what we're doing is computing our position within the grid of samples
corresponding to a single pixel:

brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(x_scale));
brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(x_scale * y_scale));


> +      brw_ADD(&func, vec16(t1_f), t1_f, t2_f);
> +      brw_MOV(&func, vec16(S), t1_f);
> +
> +      if (num_samples == 8) {
> +         /* Map the sample index to a sample number */
> +         brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
> +                 S, brw_imm_d(4));
> +         brw_IF(&func, BRW_EXECUTE_16);
> +         {
> +            brw_MOV(&func, vec16(t2), brw_imm_d(5));
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(1));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(2));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(2));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(4));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(3));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(6));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +         }
> +         brw_ELSE(&func);
> +         {
> +            brw_MOV(&func, vec16(t2), brw_imm_d(0));
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(5));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(3));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(6));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(7));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +            brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
> +                    S, brw_imm_d(7));
> +            brw_MOV(&func, vec16(t2), brw_imm_d(1));
> +            brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> +         }
> +         brw_ENDIF(&func);
> +         brw_MOV(&func, vec16(S), t2);
>

This seems like a lot of work to accomplish what is effectively a lookup
table.  If this winds up becoming a performance bottleneck, you might want
to consider passing the table in via a push constant, and using indirect
addressing to convert sample index to sample number.


> +      }
> +      texel_fetch(texture_data[i]);
> +
> +      if (i == 0 && key->tex_layout == INTEL_MSAA_LAYOUT_CMS) {
> +         /* The Ivy Bridge PRM, Vol4 Part1 p27 (Multisample Control
> Surface)
> +          * suggests an optimization:
> +          *
> +          *     "A simple optimization with probable large return in
> +          *     performance is to compare the MCS value to zero
> (indicating
> +          *     all samples are on sample slice 0), and sample only from
> +          *     sample slice 0 using ld2dss if MCS is zero."
> +          *
> +          * Note that in the case where the MCS value is zero, sampling
> from
> +          * sample slice 0 using ld2dss and sampling from sample 0 using
> +          * ld2dms are equivalent (since all samples are on sample slice
> 0).
> +          * Since we have already sampled from sample 0, all we need to
> do is
> +          * skip the remaining fetches and averaging if MCS is zero.
> +          */
> +         brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_NZ,
> +                 mcs_data, brw_imm_ud(0));
> +         brw_IF(&func, BRW_EXECUTE_16);
> +      }
>

We can't do this optimization for scaled multisample blits, because it
relies on the assumption that all the samples we're blending together
belong to the same pixel.  I'd recommend just pulling this code out.


> +   }
> +
> +#define SAMPLE(x, y) offset(texture_data[x], y)
> +   brw_set_access_mode(&func, BRW_ALIGN_16);
> +   for (int index = 3; index > 0; ) {
> +      for (int k = 0; k < 8; ++k)
> +         brw_LRP(&func,
> +                 vec8(SAMPLE(index - 1, k)),
> +                 offset(x_lerp, k & 1),
> +                 SAMPLE(index, k),
> +                 SAMPLE(index - 1, k));
> +      index -= 2;
> +   }
> +   for (int k = 0; k < 8; ++k)
> +      brw_LRP(&func,
> +              vec8(SAMPLE(0, k)),
> +              offset(y_lerp, k & 1),
> +              vec8(SAMPLE(2, k)),
> +              vec8(SAMPLE(0, k)));
> +   brw_set_access_mode(&func, BRW_ALIGN_1);
>

I'm confused why we need loops from 0 to 7 here.  It looks like you're
trying to interpolate each component of the SIMD8 register separately.
That shouldn't be necessary.


> +#undef SAMPLE
> +   if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
> +      brw_ENDIF(&func);
> +}
> +
>  /**
>   * Emit code to look up a value in the texture using the SAMPLE message
> (which
>   * does blending of MSAA surfaces).
> @@ -1535,7 +1760,8 @@ brw_blorp_blit_program::sample(struct brw_reg dst)
>        SAMPLER_MESSAGE_ARG_V_FLOAT
>     };
>
> -   texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args,
> ARRAY_SIZE(args));
> +   texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args,
> +                  ARRAY_SIZE(args));
>  }
>
>  /**
> @@ -1872,6 +2098,17 @@ brw_blorp_blit_params::brw_blorp_blit_params(struct
> brw_context *brw,
>        wm_prog_key.persample_msaa_dispatch = true;
>     }
>
> +   /* Scaled blitting or not. */
> +   wm_prog_key.blit_scaled =
> +      ((dst_x1 - dst_x0) == (src_x1 - src_x0) &&
> +       (dst_y1 - dst_y0) == (src_y1 - src_y0)) ? false : true;
> +
> +   /* Source rectangle dimensions */
> +   wm_prog_key.src_x0 = src_x0;
> +   wm_prog_key.src_y0 = src_y0;
> +   wm_prog_key.src_x1 = src_x1;
> +   wm_prog_key.src_y1 = src_y1;
> +
>     /* The render path must be configured to use the same number of
> samples as
>      * the destination buffer.
>      */
> --
> 1.8.1.4
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130625/2419d7eb/attachment-0001.html>


More information about the mesa-dev mailing list