[Mesa-dev] [PATCH V3 1/2] i965/blorp: Add bilinear filtering of samples for multisample scaled blits
Anuj Phogat
anuj.phogat at gmail.com
Wed Jun 26 14:57:32 PDT 2013
On Tue, Jun 25, 2013 at 10:27 AM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 19 June 2013 19:45, Anuj Phogat <anuj.phogat at gmail.com> wrote:
>>
>> Current implementation of ext_framebuffer_multisample_blit_scaled in
>> i965/blorp uses nearest filtering for multisample scaled blits. Using
>> nearest filtering produces blocky artifacts and negates the benefits
>> of MSAA. That is the reason why extension was not enabled on i965.
>>
>> This patch implements the bilinear filtering of samples in blorp engine.
>> Images generated with this patch are free from blocky artifacts and show
>> big improvement in visual quality.
>>
>> Observed no piglit and gles3 regressions.
>>
>> V3:
>> - Algorithm used for filtering assumes a rectangular grid of samples
>> roughly corresponding to sample locations.
>> - Test the boundary conditions on the edges of texture.
>>
>> Signed-off-by: Anuj Phogat <anuj.phogat at gmail.com>
>
>
> Thanks for all your effort on this, Anuj. I think we have an algorithm
> that's going to get us good quality. I have a number of minor comments
> below, but I think the basic approach is good.
>
Thanks for the detailed review Paul. All of your suggestions are very useful
to improve the patch.
>>
>> ---
>> src/mesa/drivers/dri/i965/brw_blorp.h | 11 ++
>> src/mesa/drivers/dri/i965/brw_blorp_blit.cpp | 257
>> +++++++++++++++++++++++++--
>> 2 files changed, 258 insertions(+), 10 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.h
>> b/src/mesa/drivers/dri/i965/brw_blorp.h
>> index ffc27cc..0a15b89 100644
>> --- a/src/mesa/drivers/dri/i965/brw_blorp.h
>> +++ b/src/mesa/drivers/dri/i965/brw_blorp.h
>> @@ -319,6 +319,17 @@ struct brw_blorp_blit_prog_key
>> * than one sample per pixel.
>> */
>> bool persample_msaa_dispatch;
>> +
>> + /* True for scaled blitting. */
>> + bool blit_scaled;
>> +
>> + /* Source rectangle dimensions. Used to test boundary conditions in
>> shader
>> + * program.
>> + */
>> + float src_x0;
>> + float src_y0;
>> + float src_x1;
>> + float src_y1;
>
>
> Two comments about this:
>
> (1) Rather than clamp to the boundary of the source rectangle, I have a
> minor preference for clamping to the boundary of the source image. This is
> what I've observed in nVidia's blitter, and it's what's implemented in
> Mesa's meta path for scaled blits. Also, it would have the benefit of
> simplifying some of the logic below, since the lower clamp will always be 0.
>
> (2) We can't put these numbers in the program key, since they're going to
> change frequently, and changing the program key causes the program to be
> recompiled. They need to go in push constants, like we do for the
> destination rectangle coordinates and the multipliers and offsets.
>
I agree.
>>
>> };
>>
>> class brw_blorp_blit_params : public brw_blorp_params
>> diff --git a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> index 8694128..e99c9df 100644
>> --- a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> @@ -622,7 +622,8 @@ private:
>> void kill_if_outside_dst_rect();
>> void translate_dst_to_src();
>> void single_to_blend();
>> - void manual_blend(unsigned num_samples);
>> + void manual_blend_average(unsigned num_samples);
>> + void manual_blend_linear(unsigned num_samples);
>
>
> Minor nit pick: can we call this function manual_blend_bilinear()? "linear"
> is such a generic term that it's hard to guess what it does, but "bilinear"
> makes it obvious.
>
>>
>> void sample(struct brw_reg dst);
>> void texel_fetch(struct brw_reg dst);
>> void mcs_fetch();
>> @@ -676,6 +677,16 @@ private:
>> */
>> struct brw_reg y_coords[2];
>>
>> + /* X, Y coordinates of the pixel from which we need to fetch the
>> specific
>> + * sample. These are used for multisample scaled blitting.
>> + */
>> + struct brw_reg x_sample_coords;
>> + struct brw_reg y_sample_coords;
>> +
>> + /* Store the values to use to interpolate in x and y directions */
>> + struct brw_reg x_lerp;
>> + struct brw_reg y_lerp;
>> +
>
>
> I had trouble guessing what the "lerp" variables are used for. How about
> something like this instead?
>
> /* Fractional parts of the x and y coordinates, used as bilinear
> interpolation coefficients */
> struct brw_reg x_frac;
> struct brw_reg y_frac;
>
I'll change the variable names.
>>
>> /* Which element of x_coords and y_coords is currently in use.
>> */
>> int xy_coord_index;
>> @@ -814,15 +825,17 @@ brw_blorp_blit_program::compile(struct brw_context
>> *brw,
>> * that we want to texture from. Exception: if we are blending, then
>> S is
>> * irrelevant, because we are going to fetch all samples.
>> */
>> - if (key->blend) {
>> + if (key->blend && !key->blit_scaled) {
>> if (brw->intel.gen == 6) {
>> /* Gen6 hardware an automatically blend using the SAMPLE message
>> */
>> single_to_blend();
>> sample(texture_data[0]);
>> } else {
>> /* Gen7+ hardware doesn't automaticaly blend. */
>> - manual_blend(key->src_samples);
>> + manual_blend_average(key->src_samples);
>> }
>> + } else if(key->blend && key->blit_scaled) {
>> + manual_blend_linear(key->src_samples);
>> } else {
>> /* We aren't blending, which means we just want to fetch a single
>> sample
>> * from the source surface. The address that we want to fetch from
>> is
>> @@ -913,6 +926,18 @@ brw_blorp_blit_program::alloc_regs()
>> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
>> reg += 2;
>> }
>> +
>> + if (key->blit_scaled && key->blend) {
>> + this->x_sample_coords = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->y_sample_coords = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->x_lerp = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->y_lerp = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + }
>> +
>> this->xy_coord_index = 0;
>> this->sample_index
>> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
>> @@ -1368,11 +1393,82 @@ brw_blorp_blit_program::translate_dst_to_src()
>> brw_MUL(&func, Y_f, Yp_f, y_transform.multiplier);
>> brw_ADD(&func, X_f, X_f, x_transform.offset);
>> brw_ADD(&func, Y_f, Y_f, y_transform.offset);
>> - /* Round the float coordinates down to nearest integer by moving to
>> - * UD registers.
>> - */
>> - brw_MOV(&func, Xp, X_f);
>> - brw_MOV(&func, Yp, Y_f);
>> + if (key->blit_scaled && key->blend) {
>> + float x_scale = 2.0;
>> + float y_scale = key->src_samples / 2.0;
>
>
> It would be nice to have a comment above x_scale and y_scale explaining that
> they are the scale factor between the pixel grid and the grid of samples
> that we're texturing from.
>
> Also, I'd recommend moving them to someplace more global (either the class
> definition or the program key) so that they can be accessed from
> manual_blend_linear().
>
yes. I'll do.
>>
>> + /* Translate coordinates to lay out the samples in a rectangular
>> grid
>> + * roughly corresponding to sample locations.
>>
>> + */
>> + brw_ADD(&func, X_f, X_f, brw_imm_f(-0.25));
>> + brw_ADD(&func, Y_f, Y_f, brw_imm_f(-1.0 / key->src_samples));
>> + brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
>> + brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));
>
>
> The additive constants -0.25 and -1.0/key->src_samples are a bit difficult
> to understand. I believe this is equivalent, and would be easier to follow:
>
> /* Translate coordinates to lay out the samples in a rectangular grid
> * roughly corresponding to sample locations.
> */
> brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
> brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));
>
> /* Adjust coordinates so that integers represent pixel centers rather
> * than pixel edges.
> */
> brw_ADD(&func, X_f, X_f, brw_imm_f(-0.5));
> brw_ADD(&func, Y_f, Y_f, brw_imm_f(-0.5));
>
Yes, this looks better.
>>
>> +
>> + /* Test boundary conditions to properly handle the sampling of
>> texels on
>> + * texture edges.
>> + */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + X_f, brw_imm_f(2.0 * key->src_x0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Left Edge in X */
>> + brw_MOV(&func, x_lerp, brw_imm_f(1.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE,
>> + X_f, brw_imm_f(2.0 * (key->src_x1 - 0.50)));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Right Edge in X */
>> + brw_MOV(&func, x_lerp, brw_imm_f(0.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + /* Store the fractional part to be used as color interpolator
>> */
>> + brw_FRC(&func, x_lerp, X_f);
>> + }
>> + brw_ENDIF(&func);
>> + }
>> + brw_ENDIF(&func);
>>
>> +
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + Y_f, brw_imm_f((y_scale) * key->src_y0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Bottom Edge in Y */
>> + brw_MOV(&func, y_lerp, brw_imm_f(1.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE, Y_f,
>> + brw_imm_f((y_scale) * (key->src_y1 - 1 / y_scale)));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Top Edge in Y */
>> + brw_MOV(&func, y_lerp, brw_imm_f(0.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + /* Store the fractional part to be used as color interpolator
>> */
>> + brw_FRC(&func, y_lerp, Y_f);
>> + }
>> + brw_ENDIF(&func);
>> + }
>> + brw_ENDIF(&func);
>
>
> A few thoughts on the above conditional logic:
>
> (1) I think the math would be easier to follow if the following expressions
> were changed to equivalent forms:
>
> "2.0 * key->src_x0" => "x_scale * key->src_x0"
> "2.0 * (key->src_x1 - 0.50)" => "x_scale * key->src_x1 - 1.0"
> "(y_scale) * (key->src_y1 - 1 / y_scale)" => "y_scale * key->src_y1 - 1.0"
>
> This will also help in the future when we want to support hardware with 16x
> MSAA, because in that case we'll want x_scale = y_scale = 4.
>
> (2) Effectively what this code is doing is clamping the x and y coordinates
> to the boundary of the source rectangle. But since it's accomplishing the
> clamping by adjusting x_lerp to 0.0 or 1.0, can only clamp things that
> aren't too far outside the source rectangle to begin with (i.e. no more than
> 1.0 outside the boundary). I *think* that will work, but it seems like an
> unnecessary risk (and an unnecessary source of confusion). Why not just
> clamp X_f to the range [x_scale * key->src_x0, x_scale * key->src_x1 - 1.0]
> (and similar for Y_f), and then afterward compute x_lerp = FRC(X_f)? That
> way there will be no doubt that the coordinate is properly in range.
>
> (3) You can probably save a lot of instructions if you use conditional moves
> to do the clamping, e.g.
>
> brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, X_f,
> brw_imm_f(x_scale * key->src_x0));
> brw_MOV(&func, X_f, brw_imm_f(x_scale * key->src_x0));
> brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> ...etc.
>
yeah, this will save a lot of instructions.
>>
>> +
>> + /* Round the float coordinates down to nearest integer */
>> + brw_RNDD(&func, Xp_f, X_f);
>> + brw_RNDD(&func, Yp_f, Y_f);
>> + brw_MUL(&func, X_f, Xp_f, brw_imm_f(0.5));
>> + brw_MUL(&func, Y_f, Yp_f, brw_imm_f(1 / y_scale));
>> + } else {
>> + /* Round the float coordinates down to nearest integer by moving to
>> + * UD registers.
>> + */
>> + brw_MOV(&func, Xp, X_f);
>> + brw_MOV(&func, Yp, Y_f);
>> + }
>> SWAP_XY_AND_XPYP();
>> brw_set_compression_control(&func, BRW_COMPRESSION_NONE);
>> }
>> @@ -1418,7 +1514,7 @@ inline int count_trailing_one_bits(unsigned value)
>>
>>
>> void
>> -brw_blorp_blit_program::manual_blend(unsigned num_samples)
>> +brw_blorp_blit_program::manual_blend_average(unsigned num_samples)
>> {
>> if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> mcs_fetch();
>> @@ -1523,6 +1619,135 @@ brw_blorp_blit_program::manual_blend(unsigned
>> num_samples)
>> brw_ENDIF(&func);
>> }
>>
>> +void
>> +brw_blorp_blit_program::manual_blend_linear(unsigned num_samples)
>> +{
>> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> + mcs_fetch();
>
>
> This won't work. The MCS value we fetch has to match up with the pixel that
> we're sampling from. Since this function samples from different pixels in
> each iteration of the "for" loop below, the call to mcs_fetch() needs to go
> inside the loop, and it needs to happen after storing the coordinates in X
> and Y.
>
>>
>> +
>> + /* We do this computation by performing the following operations:
>> + *
>> + * In case of 4x, 8x MSAA:
>> + * - Compute the pixel coordinates and sample numbers (a, b, c, d)
>> + * which are later used for interpolation
>> + * - linearly interpolate samples a and b in X
>> + * - linearly interpolate samples c and d in X
>> + * - linearly interpolate the results of last two operations in Y
>> + *
>> + * result = lrp(lrp(a + b) + lrp(c + d))
>> + */
>> + struct brw_reg Xp_f = retype(Xp, BRW_REGISTER_TYPE_F);
>> + struct brw_reg Yp_f = retype(Yp, BRW_REGISTER_TYPE_F);
>> + struct brw_reg t1_f = retype(t1, BRW_REGISTER_TYPE_F);
>> + struct brw_reg t2_f = retype(t2, BRW_REGISTER_TYPE_F);
>> +
>> + for (unsigned i = 0; i < 4; ++i) {
>> + assert(i < ARRAY_SIZE(texture_data));
>> + s_is_zero = false;
>> +
>> + /* Compute pixel coordinates */
>> + brw_ADD(&func, vec16(x_sample_coords), Xp_f,
>> + brw_imm_f((float)(i & 0x1) * 0.5));
>> + brw_ADD(&func, vec16(y_sample_coords), Yp_f,
>> + brw_imm_f((float)(i & 0x2) * (1.0 / num_samples)));
>> + brw_MOV(&func, vec16(X), x_sample_coords);
>> + brw_MOV(&func, vec16(Y), y_sample_coords);
>> +
>> + /* Compute sample index. In case of 4x MSAA, sample index is
>> + * equal to sample number.
>> + */
>
>
> This comment should explain what you mean by "sample index" vs "sample
> number".
>
>>
>> + brw_FRC(&func, vec16(t1_f), x_sample_coords);
>> + brw_FRC(&func, vec16(t2_f), y_sample_coords);
>> + brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(2.0));
>> + brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(num_samples));
>
>
> How about this instead? It's a little more future proof and it clarifies
> that what we're doing is computing our position within the grid of samples
> corresponding to a single pixel:
>
> brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(x_scale));
> brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(x_scale * y_scale));
>
>>
>> + brw_ADD(&func, vec16(t1_f), t1_f, t2_f);
>> + brw_MOV(&func, vec16(S), t1_f);
>> +
>> + if (num_samples == 8) {
>> + /* Map the sample index to a sample number */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + S, brw_imm_d(4));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + brw_MOV(&func, vec16(t2), brw_imm_d(5));
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(1));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(2));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(2));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(4));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(3));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(6));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_MOV(&func, vec16(t2), brw_imm_d(0));
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(5));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(3));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(6));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(7));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(7));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(1));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + }
>> + brw_ENDIF(&func);
>> + brw_MOV(&func, vec16(S), t2);
>
>
> This seems like a lot of work to accomplish what is effectively a lookup
> table. If this winds up becoming a performance bottleneck, you might want
> to consider passing the table in via a push constant, and using indirect
> addressing to convert sample index to sample number.
>
Yes, that would be helpful. Is it acceptable to do this optimization as a
follow up patch?
>>
>> + }
>> + texel_fetch(texture_data[i]);
>> +
>> + if (i == 0 && key->tex_layout == INTEL_MSAA_LAYOUT_CMS) {
>> + /* The Ivy Bridge PRM, Vol4 Part1 p27 (Multisample Control
>> Surface)
>> + * suggests an optimization:
>> + *
>> + * "A simple optimization with probable large return in
>> + * performance is to compare the MCS value to zero
>> (indicating
>> + * all samples are on sample slice 0), and sample only from
>> + * sample slice 0 using ld2dss if MCS is zero."
>> + *
>> + * Note that in the case where the MCS value is zero, sampling
>> from
>> + * sample slice 0 using ld2dss and sampling from sample 0 using
>> + * ld2dms are equivalent (since all samples are on sample slice
>> 0).
>> + * Since we have already sampled from sample 0, all we need to
>> do is
>> + * skip the remaining fetches and averaging if MCS is zero.
>> + */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_NZ,
>> + mcs_data, brw_imm_ud(0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + }
>
>
> We can't do this optimization for scaled multisample blits, because it
> relies on the assumption that all the samples we're blending together belong
> to the same pixel. I'd recommend just pulling this code out.
>
Makes sense. I'll remove this code.
>>
>> + }
>> +
>> +#define SAMPLE(x, y) offset(texture_data[x], y)
>> + brw_set_access_mode(&func, BRW_ALIGN_16);
>> + for (int index = 3; index > 0; ) {
>> + for (int k = 0; k < 8; ++k)
>> + brw_LRP(&func,
>> + vec8(SAMPLE(index - 1, k)),
>> + offset(x_lerp, k & 1),
>> + SAMPLE(index, k),
>> + SAMPLE(index - 1, k));
>> + index -= 2;
>> + }
>> + for (int k = 0; k < 8; ++k)
>> + brw_LRP(&func,
>> + vec8(SAMPLE(0, k)),
>> + offset(y_lerp, k & 1),
>> + vec8(SAMPLE(2, k)),
>> + vec8(SAMPLE(0, k)));
>> + brw_set_access_mode(&func, BRW_ALIGN_1);
>
>
> I'm confused why we need loops from 0 to 7 here. It looks like you're
> trying to interpolate each component of the SIMD8 register separately. That
> shouldn't be necessary.
>
Output of each SIMD16 sampler instruction (send) is written to 8 grf registers.
So, I'm looping from 0 to 7 to interpolate all these 8 registers. If we lower
the loop counter, we end up with color data 'not-interpolated' in few color
channels. Does this explain your concern?
>>
>> +#undef SAMPLE
>> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> + brw_ENDIF(&func);
>> +}
>> +
>> /**
>> * Emit code to look up a value in the texture using the SAMPLE message
>> (which
>> * does blending of MSAA surfaces).
>> @@ -1535,7 +1760,8 @@ brw_blorp_blit_program::sample(struct brw_reg dst)
>> SAMPLER_MESSAGE_ARG_V_FLOAT
>> };
>>
>> - texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args,
>> ARRAY_SIZE(args));
>> + texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args,
>> + ARRAY_SIZE(args));
>> }
>>
>> /**
>> @@ -1872,6 +2098,17 @@ brw_blorp_blit_params::brw_blorp_blit_params(struct
>> brw_context *brw,
>> wm_prog_key.persample_msaa_dispatch = true;
>> }
>>
>> + /* Scaled blitting or not. */
>> + wm_prog_key.blit_scaled =
>> + ((dst_x1 - dst_x0) == (src_x1 - src_x0) &&
>> + (dst_y1 - dst_y0) == (src_y1 - src_y0)) ? false : true;
>> +
>> + /* Source rectangle dimensions */
>> + wm_prog_key.src_x0 = src_x0;
>> + wm_prog_key.src_y0 = src_y0;
>> + wm_prog_key.src_x1 = src_x1;
>> + wm_prog_key.src_y1 = src_y1;
>> +
>> /* The render path must be configured to use the same number of
>> samples as
>> * the destination buffer.
>> */
>> --
>> 1.8.1.4
>>
>
More information about the mesa-dev
mailing list