[Mesa-dev] [PATCH 4/4] i965/fs: Implement pass to lower instructions of unsupported SIMD width.

Francisco Jerez currojerez at riseup.net
Wed Jul 22 00:31:46 PDT 2015


Jason Ekstrand <jason at jlekstrand.net> writes:

> A few comments below.  Mostly just asking for explanation.
>
> 1-3 are
>
> Reviewed-by: Jason Ekstrand <jason.ekstrand at intel.com>
>
> Obviously, don't merge 4/4 until it actually has users.
> --Jason

Thanks.

>
> On Thu, Jul 16, 2015 at 8:35 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> This lowering pass implements an algorithm to expand SIMDN
>> instructions into a sequence of SIMDM instructions in cases where the
>> hardware doesn't support the original execution size natively for some
>> particular instruction.  The most important use-cases are:
>>
>>  - Lowering send message instructions that don't support SIMD16
>>    natively into SIMD8 (several texturing, framebuffer write and typed
>>    surface operations).
>>
>>  - Lowering messages that don't support SIMD8 natively into SIMD16
>>    (*cough*gen4*cough*).
>>
>>  - 64-bit precision operations (e.g. FP64 and 64-bit integer
>>    multiplication).
>>
>>  - SIMD32.
>>
>> The algorithm works by splitting the sources of the original
>> instruction into chunks of width appropriate for the lowered
>> instructions, and then interleaving the results component-wise into
>> the destination of the original instruction.  The pass is controlled
>> by the get_lowered_simd_width() function that currently just returns
>> the original execution size making the whole pass a no-op for the
>> moment until some user is introduced.
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 142 +++++++++++++++++++++++++++++++++++
>>  src/mesa/drivers/dri/i965/brw_fs.h   |   1 +
>>  2 files changed, 143 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index d031352..eeb6938 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -3204,6 +3204,147 @@ fs_visitor::lower_logical_sends()
>>     return progress;
>>  }
>>
>> +/**
>> + * Get the closest native SIMD width supported by the hardware for instruction
>> + * \p inst.  The instruction will be left untouched by
>> + * fs_visitor::lower_simd_width() if the returned value is equal to the
>> + * original execution size.
>> + */
>> +static unsigned
>> +get_lowered_simd_width(const struct brw_device_info *devinfo,
>> +                       const fs_inst *inst)
>> +{
>> +   switch (inst->opcode) {
>> +   default:
>> +      return inst->exec_size;
>> +   }
>> +}
>> +
>> +/**
>> + * The \p rows array of registers represents a \p num_rows by \p num_columns
>> + * matrix in row-major order, write it in column-major order into the register
>> + * passed as destination.  \p stride gives the separation between matrix
>> + * elements in the input in fs_builder::dispatch_width() units.
>> + */
>> +static void
>> +emit_transpose(const fs_builder &bld,
>> +               const fs_reg &dst, const fs_reg *rows,
>> +               unsigned num_rows, unsigned num_columns, unsigned stride)
>
> I'm not sure what I think about calling this "emit_transpose".  I
> guess it is kind of a transpose, but maybe it's more of a "gather".
> I'm not going to quibble about it though.

*Shrug*, it doesn't only gather the vectors of the rows array (that
would have been emit_collect :P), it copies them out in vertical, just
like a matrix transpose -- assuming you're not horrified by the idea of
considering the argument a matrix.

>
>> +{
>> +   fs_reg *const components = new fs_reg[num_rows * num_columns];
>> +
>> +   for (unsigned i = 0; i < num_columns; ++i) {
>> +      for (unsigned j = 0; j < num_rows; ++j)
>> +         components[num_rows * i + j] = offset(rows[j], bld, stride * i);
>> +   }
>> +
>> +   bld.LOAD_PAYLOAD(dst, components, num_rows * num_columns, 0);
>> +
>> +   delete[] components;
>> +}
>> +
>> +bool
>> +fs_visitor::lower_simd_width()
>> +{
>> +   bool progress = false;
>> +
>> +   foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>> +      const unsigned lower_width = get_lowered_simd_width(devinfo, inst);
>> +
>> +      if (lower_width != inst->exec_size) {
>> +         /* Builder matching the original instruction. */
>> +         const fs_builder ibld = bld.at(block, inst)
>> +                                    .exec_all(inst->force_writemask_all)
>> +                                    .group(inst->exec_size, inst->force_sechalf);
>> +
>> +         /* Split the copies in chunks of the execution width of either the
>> +          * original or the lowered instruction, whichever is lower.
>> +          */
>> +         const unsigned copy_width = MIN2(lower_width, inst->exec_size);
>> +         const unsigned n = inst->exec_size / copy_width;
>> +         const unsigned dst_size = inst->regs_written * REG_SIZE /
>> +            inst->dst.component_size(inst->exec_size);
>> +         fs_reg dsts[4];
>> +
>> +         assert(n > 0 && n <= ARRAY_SIZE(dsts) &&
>> +                !inst->writes_accumulator && !inst->mlen);
>> +
>> +         for (unsigned i = 0; i < n; i++) {
>> +            /* Emit a copy of the original instruction with the lowered width.
>> +             * If the EOT flag was set throw it away except for the last
>> +             * instruction to avoid killing the thread prematurely.
>> +             */
>> +            fs_inst tmp_inst = *inst;
>> +            tmp_inst.exec_size = lower_width;
>> +            tmp_inst.eot = inst->eot && i == n - 1;
>> +
>> +            /* Set exec_all if the lowered width is higher than the original
>> +             * to avoid breaking the compiler invariant that no control
>> +             * flow-masked instruction is wider than the shader's
>> +             * dispatch_width.  Then emit the lowered instruction.
>> +             */
>> +            const fs_builder lbld = ibld.exec_all(lower_width > inst->exec_size)
>> +                                        .group(lower_width, i);
>> +            fs_inst *split_inst = lbld.emit(tmp_inst);
>
> Why are  you emitting the split instruction before the source
> transposes?  Don't they need to go first?  Maybe I'm missing
> something.
>
I guess it doesn't really matter in which order they are inserted as
long as they end up in the correct order inside the program (and as you
can see below there is an at() call making sure that the builder used to
emit the transposes is pointing before the split instruction).

No important reason really, other than to be able to assign the
temporaries allocated below to the split instruction sources directly.

>> +
>> +            for (unsigned j = 0; j < inst->sources; j++) {
>> +               if (inst->src[j].file != BAD_FILE &&
>> +                   !is_uniform(inst->src[j])) {
>> +                  /* Get the i-th copy_width-wide chunk of the source. */
>> +                  const fs_reg src = horiz_offset(inst->src[j], copy_width * i);
>> +                  const unsigned src_size = inst->regs_read(j) * REG_SIZE /
>> +                     inst->src[j].component_size(inst->exec_size);
>> +
>> +                  /* Use a trivial transposition to copy one every n
>> +                   * copy_width-wide components of the register into a
>> +                   * temporary passed as source to the lowered instruction.
>> +                   */
>> +                  split_inst->src[j] = lbld.vgrf(inst->src[j].type, src_size);
>> +                  emit_transpose(lbld.group(copy_width, 0).at(block, split_inst),
>> +                                 split_inst->src[j], &src, 1, src_size, n);
>
> Shouldn't this be group(copy_width, i)?
>

I'm already choosing the i-th channel group in the definition of lbld
while emitting the split instruction, this group() call is just to make
sure that the transposes don't copy data into the garbage channels of
the temporary in cases where copy_width is less than lower_width.

>> +               }
>> +            }
>> +
>> +            if (inst->regs_written) {
>> +               /* Allocate enough space to hold the result of the lowered
>> +                * instruction and fix up the number of registers written.
>> +                */
>> +               split_inst->dst = dsts[i] =
>> +                  lbld.vgrf(inst->dst.type, dst_size);
>> +               split_inst->regs_written =
>> +                  DIV_ROUND_UP(inst->regs_written * lower_width,
>> +                               inst->exec_size);
>> +            }
>> +         }
>> +
>> +         if (inst->regs_written) {
>> +            /* Distance between useful channels in the temporaries, skipping
>> +             * garbage if the lowered instruction is wider than the original.
>> +             */
>> +            const unsigned m = lower_width / copy_width;
>> +
>> +            /* Interleave the components of the result from the lowered
>> +             * instructions.  We need to set exec_all() when copying more than
>> +             * one half per component, because LOAD_PAYLOAD (in terms of which
>> +             * emit_transpose is implemented) can only use the same channel
>> +             * enable signals for all of its non-header sources.
>> +             */
>> +            emit_transpose(ibld.exec_all(inst->exec_size > copy_width)
>> +                               .group(copy_width, 0),
>> +                           inst->dst, dsts, n, dst_size, m);
>> +         }
>> +
>> +         inst->remove(block);
>> +         progress = true;
>> +      }
>> +   }
>> +
>> +   if (progress)
>> +      invalidate_live_intervals();
>> +
>> +   return progress;
>> +}
>> +
>>  void
>>  fs_visitor::dump_instructions()
>>  {
>> @@ -3655,6 +3796,7 @@ fs_visitor::optimize()
>>     int iteration = 0;
>>     int pass_num = 0;
>>
>> +   OPT(lower_simd_width);
>>     OPT(lower_logical_sends);
>>
>>     do {
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>> index f3850d1..9582648 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>> @@ -184,6 +184,7 @@ public:
>>     bool lower_load_payload();
>>     bool lower_logical_sends();
>>     bool lower_integer_multiplication();
>> +   bool lower_simd_width();
>>     bool opt_combine_constants();
>>
>>     void emit_dummy_fs();
>> --
>> 2.4.3
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150722/4ef9744c/attachment-0001.sig>


More information about the mesa-dev mailing list