[Mesa-dev] [PATCH V3 1/2] i965/blorp: Add bilinear filtering of samples for multisample scaled blits
Anuj Phogat
anuj.phogat at gmail.com
Wed Jun 26 19:41:19 PDT 2013
On Tue, Jun 25, 2013 at 10:27 AM, Paul Berry <stereotype441 at gmail.com> wrote:
>
> On 19 June 2013 19:45, Anuj Phogat <anuj.phogat at gmail.com> wrote:
>>
>> Current implementation of ext_framebuffer_multisample_blit_scaled in
>> i965/blorp uses nearest filtering for multisample scaled blits. Using
>> nearest filtering produces blocky artifacts and negates the benefits
>> of MSAA. That is the reason why extension was not enabled on i965.
>>
>> This patch implements the bilinear filtering of samples in blorp engine.
>> Images generated with this patch are free from blocky artifacts and show
>> big improvement in visual quality.
>>
>> Observed no piglit and gles3 regressions.
>>
>> V3:
>> - Algorithm used for filtering assumes a rectangular grid of samples
>> roughly corresponding to sample locations.
>> - Test the boundary conditions on the edges of texture.
>>
>> Signed-off-by: Anuj Phogat <anuj.phogat at gmail.com>
>
>
> Thanks for all your effort on this, Anuj. I think we have an algorithm that's going to get us good quality. I have a number of minor comments below, but I think the basic approach is good.
>
>>
>> ---
>> src/mesa/drivers/dri/i965/brw_blorp.h | 11 ++
>> src/mesa/drivers/dri/i965/brw_blorp_blit.cpp | 257 +++++++++++++++++++++++++--
>> 2 files changed, 258 insertions(+), 10 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.h b/src/mesa/drivers/dri/i965/brw_blorp.h
>> index ffc27cc..0a15b89 100644
>> --- a/src/mesa/drivers/dri/i965/brw_blorp.h
>> +++ b/src/mesa/drivers/dri/i965/brw_blorp.h
>> @@ -319,6 +319,17 @@ struct brw_blorp_blit_prog_key
>> * than one sample per pixel.
>> */
>> bool persample_msaa_dispatch;
>> +
>> + /* True for scaled blitting. */
>> + bool blit_scaled;
>> +
>> + /* Source rectangle dimensions. Used to test boundary conditions in shader
>> + * program.
>> + */
>> + float src_x0;
>> + float src_y0;
>> + float src_x1;
>> + float src_y1;
>
>
> Two comments about this:
>
> (1) Rather than clamp to the boundary of the source rectangle, I have a minor preference for clamping to the boundary of the source image. This is what I've observed in nVidia's blitter, and it's what's implemented in Mesa's meta path for scaled blits. Also, it would have the benefit of simplifying some of the logic below, since the lower clamp will always be 0.
>
> (2) We can't put these numbers in the program key, since they're going to change frequently, and changing the program key causes the program to be recompiled. They need to go in push constants, like we do for the destination rectangle coordinates and the multipliers and offsets.
>
>
>>
>> };
>>
>> class brw_blorp_blit_params : public brw_blorp_params
>> diff --git a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> index 8694128..e99c9df 100644
>> --- a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp
>> @@ -622,7 +622,8 @@ private:
>> void kill_if_outside_dst_rect();
>> void translate_dst_to_src();
>> void single_to_blend();
>> - void manual_blend(unsigned num_samples);
>> + void manual_blend_average(unsigned num_samples);
>> + void manual_blend_linear(unsigned num_samples);
>
>
> Minor nit pick: can we call this function manual_blend_bilinear()? "linear" is such a generic term that it's hard to guess what it does, but "bilinear" makes it obvious.
>
>>
>> void sample(struct brw_reg dst);
>> void texel_fetch(struct brw_reg dst);
>> void mcs_fetch();
>> @@ -676,6 +677,16 @@ private:
>> */
>> struct brw_reg y_coords[2];
>>
>> + /* X, Y coordinates of the pixel from which we need to fetch the specific
>> + * sample. These are used for multisample scaled blitting.
>> + */
>> + struct brw_reg x_sample_coords;
>> + struct brw_reg y_sample_coords;
>> +
>> + /* Store the values to use to interpolate in x and y directions */
>> + struct brw_reg x_lerp;
>> + struct brw_reg y_lerp;
>> +
>
>
> I had trouble guessing what the "lerp" variables are used for. How about something like this instead?
>
> /* Fractional parts of the x and y coordinates, used as bilinear interpolation coefficients */
> struct brw_reg x_frac;
> struct brw_reg y_frac;
>
>>
>> /* Which element of x_coords and y_coords is currently in use.
>> */
>> int xy_coord_index;
>> @@ -814,15 +825,17 @@ brw_blorp_blit_program::compile(struct brw_context *brw,
>> * that we want to texture from. Exception: if we are blending, then S is
>> * irrelevant, because we are going to fetch all samples.
>> */
>> - if (key->blend) {
>> + if (key->blend && !key->blit_scaled) {
>> if (brw->intel.gen == 6) {
>> /* Gen6 hardware an automatically blend using the SAMPLE message */
>> single_to_blend();
>> sample(texture_data[0]);
>> } else {
>> /* Gen7+ hardware doesn't automaticaly blend. */
>> - manual_blend(key->src_samples);
>> + manual_blend_average(key->src_samples);
>> }
>> + } else if(key->blend && key->blit_scaled) {
>> + manual_blend_linear(key->src_samples);
>> } else {
>> /* We aren't blending, which means we just want to fetch a single sample
>> * from the source surface. The address that we want to fetch from is
>> @@ -913,6 +926,18 @@ brw_blorp_blit_program::alloc_regs()
>> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
>> reg += 2;
>> }
>> +
>> + if (key->blit_scaled && key->blend) {
>> + this->x_sample_coords = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->y_sample_coords = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->x_lerp = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + this->y_lerp = brw_vec8_grf(reg, 0);
>> + reg += 2;
>> + }
>> +
>> this->xy_coord_index = 0;
>> this->sample_index
>> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD);
>> @@ -1368,11 +1393,82 @@ brw_blorp_blit_program::translate_dst_to_src()
>> brw_MUL(&func, Y_f, Yp_f, y_transform.multiplier);
>> brw_ADD(&func, X_f, X_f, x_transform.offset);
>> brw_ADD(&func, Y_f, Y_f, y_transform.offset);
>> - /* Round the float coordinates down to nearest integer by moving to
>> - * UD registers.
>> - */
>> - brw_MOV(&func, Xp, X_f);
>> - brw_MOV(&func, Yp, Y_f);
>> + if (key->blit_scaled && key->blend) {
>> + float x_scale = 2.0;
>> + float y_scale = key->src_samples / 2.0;
>
>
> It would be nice to have a comment above x_scale and y_scale explaining that they are the scale factor between the pixel grid and the grid of samples that we're texturing from.
>
> Also, I'd recommend moving them to someplace more global (either the class definition or the program key) so that they can be accessed from manual_blend_linear().
>
>>
>> + /* Translate coordinates to lay out the samples in a rectangular grid
>> + * roughly corresponding to sample locations.
>>
>> + */
>> + brw_ADD(&func, X_f, X_f, brw_imm_f(-0.25));
>> + brw_ADD(&func, Y_f, Y_f, brw_imm_f(-1.0 / key->src_samples));
>> + brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
>> + brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));
>
>
> The additive constants -0.25 and -1.0/key->src_samples are a bit difficult to understand. I believe this is equivalent, and would be easier to follow:
>
> /* Translate coordinates to lay out the samples in a rectangular grid
> * roughly corresponding to sample locations.
> */
> brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale));
> brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale));
>
> /* Adjust coordinates so that integers represent pixel centers rather
> * than pixel edges.
> */
> brw_ADD(&func, X_f, X_f, brw_imm_f(-0.5));
> brw_ADD(&func, Y_f, Y_f, brw_imm_f(-0.5));
>
>>
>> +
>> + /* Test boundary conditions to properly handle the sampling of texels on
>> + * texture edges.
>> + */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + X_f, brw_imm_f(2.0 * key->src_x0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Left Edge in X */
>> + brw_MOV(&func, x_lerp, brw_imm_f(1.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE,
>> + X_f, brw_imm_f(2.0 * (key->src_x1 - 0.50)));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Right Edge in X */
>> + brw_MOV(&func, x_lerp, brw_imm_f(0.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + /* Store the fractional part to be used as color interpolator */
>> + brw_FRC(&func, x_lerp, X_f);
>> + }
>> + brw_ENDIF(&func);
>> + }
>> + brw_ENDIF(&func);
>>
>> +
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + Y_f, brw_imm_f((y_scale) * key->src_y0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Bottom Edge in Y */
>> + brw_MOV(&func, y_lerp, brw_imm_f(1.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE, Y_f,
>> + brw_imm_f((y_scale) * (key->src_y1 - 1 / y_scale)));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + /* Top Edge in Y */
>> + brw_MOV(&func, y_lerp, brw_imm_f(0.0));
>> + }
>> + brw_ELSE(&func);
>> + {
>> + /* Store the fractional part to be used as color interpolator */
>> + brw_FRC(&func, y_lerp, Y_f);
>> + }
>> + brw_ENDIF(&func);
>> + }
>> + brw_ENDIF(&func);
>
>
> A few thoughts on the above conditional logic:
>
> (1) I think the math would be easier to follow if the following expressions were changed to equivalent forms:
>
> "2.0 * key->src_x0" => "x_scale * key->src_x0"
> "2.0 * (key->src_x1 - 0.50)" => "x_scale * key->src_x1 - 1.0"
> "(y_scale) * (key->src_y1 - 1 / y_scale)" => "y_scale * key->src_y1 - 1.0"
>
> This will also help in the future when we want to support hardware with 16x MSAA, because in that case we'll want x_scale = y_scale = 4.
>
> (2) Effectively what this code is doing is clamping the x and y coordinates to the boundary of the source rectangle. But since it's accomplishing the clamping by adjusting x_lerp to 0.0 or 1.0, can only clamp things that aren't too far outside the source rectangle to begin with (i.e. no more than 1.0 outside the boundary). I *think* that will work, but it seems like an unnecessary risk (and an unnecessary source of confusion). Why not just clamp X_f to the range [x_scale * key->src_x0, x_scale * key->src_x1 - 1.0] (and similar for Y_f), and then afterward compute x_lerp = FRC(X_f)? That way there will be no doubt that the coordinate is properly in range.
>
> (3) You can probably save a lot of instructions if you use conditional moves to do the clamping, e.g.
>
> brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, X_f, brw_imm_f(x_scale * key->src_x0));
> brw_MOV(&func, X_f, brw_imm_f(x_scale * key->src_x0));
> brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
> ...etc.
>
>>
>> +
>> + /* Round the float coordinates down to nearest integer */
>> + brw_RNDD(&func, Xp_f, X_f);
>> + brw_RNDD(&func, Yp_f, Y_f);
>> + brw_MUL(&func, X_f, Xp_f, brw_imm_f(0.5));
>> + brw_MUL(&func, Y_f, Yp_f, brw_imm_f(1 / y_scale));
>> + } else {
>> + /* Round the float coordinates down to nearest integer by moving to
>> + * UD registers.
>> + */
>> + brw_MOV(&func, Xp, X_f);
>> + brw_MOV(&func, Yp, Y_f);
>> + }
>> SWAP_XY_AND_XPYP();
>> brw_set_compression_control(&func, BRW_COMPRESSION_NONE);
>> }
>> @@ -1418,7 +1514,7 @@ inline int count_trailing_one_bits(unsigned value)
>>
>>
>> void
>> -brw_blorp_blit_program::manual_blend(unsigned num_samples)
>> +brw_blorp_blit_program::manual_blend_average(unsigned num_samples)
>> {
>> if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> mcs_fetch();
>> @@ -1523,6 +1619,135 @@ brw_blorp_blit_program::manual_blend(unsigned num_samples)
>> brw_ENDIF(&func);
>> }
>>
>> +void
>> +brw_blorp_blit_program::manual_blend_linear(unsigned num_samples)
>> +{
>> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> + mcs_fetch();
>
>
> This won't work. The MCS value we fetch has to match up with the pixel that we're sampling from. Since this function samples from different pixels in each iteration of the "for" loop below, the call to mcs_fetch() needs to go inside the loop, and it needs to happen after storing the coordinates in X and Y.
>
I think MCS value fetch will not be required anymore as we are anyway
getting rid of optimization which compares mcs value to zero.
>>
>> +
>> + /* We do this computation by performing the following operations:
>> + *
>> + * In case of 4x, 8x MSAA:
>> + * - Compute the pixel coordinates and sample numbers (a, b, c, d)
>> + * which are later used for interpolation
>> + * - linearly interpolate samples a and b in X
>> + * - linearly interpolate samples c and d in X
>> + * - linearly interpolate the results of last two operations in Y
>> + *
>> + * result = lrp(lrp(a + b) + lrp(c + d))
>> + */
>> + struct brw_reg Xp_f = retype(Xp, BRW_REGISTER_TYPE_F);
>> + struct brw_reg Yp_f = retype(Yp, BRW_REGISTER_TYPE_F);
>> + struct brw_reg t1_f = retype(t1, BRW_REGISTER_TYPE_F);
>> + struct brw_reg t2_f = retype(t2, BRW_REGISTER_TYPE_F);
>> +
>> + for (unsigned i = 0; i < 4; ++i) {
>> + assert(i < ARRAY_SIZE(texture_data));
>> + s_is_zero = false;
>> +
>> + /* Compute pixel coordinates */
>> + brw_ADD(&func, vec16(x_sample_coords), Xp_f,
>> + brw_imm_f((float)(i & 0x1) * 0.5));
>> + brw_ADD(&func, vec16(y_sample_coords), Yp_f,
>> + brw_imm_f((float)(i & 0x2) * (1.0 / num_samples)));
>> + brw_MOV(&func, vec16(X), x_sample_coords);
>> + brw_MOV(&func, vec16(Y), y_sample_coords);
>> +
>> + /* Compute sample index. In case of 4x MSAA, sample index is
>> + * equal to sample number.
>> + */
>
>
> This comment should explain what you mean by "sample index" vs "sample number".
>
>>
>> + brw_FRC(&func, vec16(t1_f), x_sample_coords);
>> + brw_FRC(&func, vec16(t2_f), y_sample_coords);
>> + brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(2.0));
>> + brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(num_samples));
>
>
> How about this instead? It's a little more future proof and it clarifies that what we're doing is computing our position within the grid of samples corresponding to a single pixel:
>
> brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(x_scale));
> brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(x_scale * y_scale));
>
>>
>> + brw_ADD(&func, vec16(t1_f), t1_f, t2_f);
>> + brw_MOV(&func, vec16(S), t1_f);
>> +
>> + if (num_samples == 8) {
>> + /* Map the sample index to a sample number */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L,
>> + S, brw_imm_d(4));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + {
>> + brw_MOV(&func, vec16(t2), brw_imm_d(5));
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(1));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(2));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(2));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(4));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(3));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(6));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + }
>> + brw_ELSE(&func);
>> + {
>> + brw_MOV(&func, vec16(t2), brw_imm_d(0));
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(5));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(3));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(6));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(7));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ,
>> + S, brw_imm_d(7));
>> + brw_MOV(&func, vec16(t2), brw_imm_d(1));
>> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE);
>> + }
>> + brw_ENDIF(&func);
>> + brw_MOV(&func, vec16(S), t2);
>
>
> This seems like a lot of work to accomplish what is effectively a lookup table. If this winds up becoming a performance bottleneck, you might want to consider passing the table in via a push constant, and using indirect addressing to convert sample index to sample number.
>
>>
>> + }
>> + texel_fetch(texture_data[i]);
>> +
>> + if (i == 0 && key->tex_layout == INTEL_MSAA_LAYOUT_CMS) {
>> + /* The Ivy Bridge PRM, Vol4 Part1 p27 (Multisample Control Surface)
>> + * suggests an optimization:
>> + *
>> + * "A simple optimization with probable large return in
>> + * performance is to compare the MCS value to zero (indicating
>> + * all samples are on sample slice 0), and sample only from
>> + * sample slice 0 using ld2dss if MCS is zero."
>> + *
>> + * Note that in the case where the MCS value is zero, sampling from
>> + * sample slice 0 using ld2dss and sampling from sample 0 using
>> + * ld2dms are equivalent (since all samples are on sample slice 0).
>> + * Since we have already sampled from sample 0, all we need to do is
>> + * skip the remaining fetches and averaging if MCS is zero.
>> + */
>> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_NZ,
>> + mcs_data, brw_imm_ud(0));
>> + brw_IF(&func, BRW_EXECUTE_16);
>> + }
>
>
> We can't do this optimization for scaled multisample blits, because it relies on the assumption that all the samples we're blending together belong to the same pixel. I'd recommend just pulling this code out.
>
>>
>> + }
>> +
>> +#define SAMPLE(x, y) offset(texture_data[x], y)
>> + brw_set_access_mode(&func, BRW_ALIGN_16);
>> + for (int index = 3; index > 0; ) {
>> + for (int k = 0; k < 8; ++k)
>> + brw_LRP(&func,
>> + vec8(SAMPLE(index - 1, k)),
>> + offset(x_lerp, k & 1),
>> + SAMPLE(index, k),
>> + SAMPLE(index - 1, k));
>> + index -= 2;
>> + }
>> + for (int k = 0; k < 8; ++k)
>> + brw_LRP(&func,
>> + vec8(SAMPLE(0, k)),
>> + offset(y_lerp, k & 1),
>> + vec8(SAMPLE(2, k)),
>> + vec8(SAMPLE(0, k)));
>> + brw_set_access_mode(&func, BRW_ALIGN_1);
>
>
> I'm confused why we need loops from 0 to 7 here. It looks like you're trying to interpolate each component of the SIMD8 register separately. That shouldn't be necessary.
>
>>
>> +#undef SAMPLE
>> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS)
>> + brw_ENDIF(&func);
>> +}
>> +
>> /**
>> * Emit code to look up a value in the texture using the SAMPLE message (which
>> * does blending of MSAA surfaces).
>> @@ -1535,7 +1760,8 @@ brw_blorp_blit_program::sample(struct brw_reg dst)
>> SAMPLER_MESSAGE_ARG_V_FLOAT
>> };
>>
>> - texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args, ARRAY_SIZE(args));
>> + texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args,
>> + ARRAY_SIZE(args));
>> }
>>
>> /**
>> @@ -1872,6 +2098,17 @@ brw_blorp_blit_params::brw_blorp_blit_params(struct brw_context *brw,
>> wm_prog_key.persample_msaa_dispatch = true;
>> }
>>
>> + /* Scaled blitting or not. */
>> + wm_prog_key.blit_scaled =
>> + ((dst_x1 - dst_x0) == (src_x1 - src_x0) &&
>> + (dst_y1 - dst_y0) == (src_y1 - src_y0)) ? false : true;
>> +
>> + /* Source rectangle dimensions */
>> + wm_prog_key.src_x0 = src_x0;
>> + wm_prog_key.src_y0 = src_y0;
>> + wm_prog_key.src_x1 = src_x1;
>> + wm_prog_key.src_y1 = src_y1;
>> +
>> /* The render path must be configured to use the same number of samples as
>> * the destination buffer.
>> */
>> --
>> 1.8.1.4
>>
>
More information about the mesa-dev
mailing list