[Mesa-dev] [PATCH 4/4] i965/fs: Implement pass to lower instructions of unsupported SIMD width.

Thu Jul 23 08:41:36 PDT 2015

On Thu, Jul 23, 2015 at 3:55 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
>
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> On Wed, Jul 22, 2015 at 10:05 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>
>>>>> On Wed, Jul 22, 2015 at 12:31 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>>>
>>>>>>> A few comments below.  Mostly just asking for explanation.
>>>>>>>
>>>>>>> 1-3 are
>>>>>>>
>>>>>>> Reviewed-by: Jason Ekstrand <jason.ekstrand at intel.com>
>>>>>>>
>>>>>>> Obviously, don't merge 4/4 until it actually has users.
>>>>>>> --Jason
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 16, 2015 at 8:35 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>>>> This lowering pass implements an algorithm to expand SIMDN
>>>>>>>> instructions into a sequence of SIMDM instructions in cases where the
>>>>>>>> hardware doesn't support the original execution size natively for some
>>>>>>>> particular instruction.  The most important use-cases are:
>>>>>>>>
>>>>>>>>  - Lowering send message instructions that don't support SIMD16
>>>>>>>>    natively into SIMD8 (several texturing, framebuffer write and typed
>>>>>>>>    surface operations).
>>>>>>>>
>>>>>>>>  - Lowering messages that don't support SIMD8 natively into SIMD16
>>>>>>>>    (*cough*gen4*cough*).
>>>>>>>>
>>>>>>>>  - 64-bit precision operations (e.g. FP64 and 64-bit integer
>>>>>>>>    multiplication).
>>>>>>>>
>>>>>>>>  - SIMD32.
>>>>>>>>
>>>>>>>> The algorithm works by splitting the sources of the original
>>>>>>>> instruction into chunks of width appropriate for the lowered
>>>>>>>> instructions, and then interleaving the results component-wise into
>>>>>>>> the destination of the original instruction.  The pass is controlled
>>>>>>>> by the get_lowered_simd_width() function that currently just returns
>>>>>>>> the original execution size making the whole pass a no-op for the
>>>>>>>> moment until some user is introduced.
>>>>>>>> ---
>>>>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 142 +++++++++++++++++++++++++++++++++++
>>>>>>>>  src/mesa/drivers/dri/i965/brw_fs.h   |   1 +
>>>>>>>>  2 files changed, 143 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> index d031352..eeb6938 100644
>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> @@ -3204,6 +3204,147 @@ fs_visitor::lower_logical_sends()
>>>>>>>>     return progress;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * Get the closest native SIMD width supported by the hardware for instruction
>>>>>>>> + * \p inst.  The instruction will be left untouched by
>>>>>>>> + * fs_visitor::lower_simd_width() if the returned value is equal to the
>>>>>>>> + * original execution size.
>>>>>>>> + */
>>>>>>>> +static unsigned
>>>>>>>> +get_lowered_simd_width(const struct brw_device_info *devinfo,
>>>>>>>> +                       const fs_inst *inst)
>>>>>>>> +{
>>>>>>>> +   switch (inst->opcode) {
>>>>>>>> +   default:
>>>>>>>> +      return inst->exec_size;
>>>>>>>> +   }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * The \p rows array of registers represents a \p num_rows by \p num_columns
>>>>>>>> + * matrix in row-major order, write it in column-major order into the register
>>>>>>>> + * passed as destination.  \p stride gives the separation between matrix
>>>>>>>> + * elements in the input in fs_builder::dispatch_width() units.
>>>>>>>> + */
>>>>>>>> +static void
>>>>>>>> +emit_transpose(const fs_builder &bld,
>>>>>>>> +               const fs_reg &dst, const fs_reg *rows,
>>>>>>>> +               unsigned num_rows, unsigned num_columns, unsigned stride)
>>>>>>>
>>>>>>> I'm not sure what I think about calling this "emit_transpose".  I
>>>>>>> guess it is kind of a transpose, but maybe it's more of a "gather".
>>>>>>> I'm not going to quibble about it though.
>>>>>>
>>>>>> *Shrug*, it doesn't only gather the vectors of the rows array (that
>>>>>> would have been emit_collect :P), it copies them out in vertical, just
>>>>>> like a matrix transpose -- assuming you're not horrified by the idea of
>>>>>> considering the argument a matrix.
>>>>>
>>>>> That's fine.  Feel free to ignore that suggestion.
>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +   fs_reg *const components = new fs_reg[num_rows * num_columns];
>>>>>>>> +
>>>>>>>> +   for (unsigned i = 0; i < num_columns; ++i) {
>>>>>>>> +      for (unsigned j = 0; j < num_rows; ++j)
>>>>>>>> +         components[num_rows * i + j] = offset(rows[j], bld, stride * i);
>>>>>>>> +   }
>>>>>>>> +
>>>>>>>> +   bld.LOAD_PAYLOAD(dst, components, num_rows * num_columns, 0);
>>>>>>>> +
>>>>>>>> +   delete[] components;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +bool
>>>>>>>> +fs_visitor::lower_simd_width()
>>>>>>>> +{
>>>>>>>> +   bool progress = false;
>>>>>>>> +
>>>>>>>> +   foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>>>>>>> +      const unsigned lower_width = get_lowered_simd_width(devinfo, inst);
>>>>>>>> +
>>>>>>>> +      if (lower_width != inst->exec_size) {
>>>>>>>> +         /* Builder matching the original instruction. */
>>>>>>>> +         const fs_builder ibld = bld.at(block, inst)
>>>>>>>> +                                    .exec_all(inst->force_writemask_all)
>>>>>>>> +                                    .group(inst->exec_size, inst->force_sechalf);
>>>>>>>> +
>>>>>>>> +         /* Split the copies in chunks of the execution width of either the
>>>>>>>> +          * original or the lowered instruction, whichever is lower.
>>>>>>>> +          */
>>>>>>>> +         const unsigned copy_width = MIN2(lower_width, inst->exec_size);
>>>>>>>> +         const unsigned n = inst->exec_size / copy_width;
>>>>>>>> +         const unsigned dst_size = inst->regs_written * REG_SIZE /
>>>>>>>> +            inst->dst.component_size(inst->exec_size);
>>>>>>>> +         fs_reg dsts[4];
>>>>>>>> +
>>>>>>>> +         assert(n > 0 && n <= ARRAY_SIZE(dsts) &&
>>>>>>>> +                !inst->writes_accumulator && !inst->mlen);
>>>>>>>> +
>>>>>>>> +         for (unsigned i = 0; i < n; i++) {
>>>>>>>> +            /* Emit a copy of the original instruction with the lowered width.
>>>>>>>> +             * If the EOT flag was set throw it away except for the last
>>>>>>>> +             * instruction to avoid killing the thread prematurely.
>>>>>>>> +             */
>>>>>>>> +            fs_inst tmp_inst = *inst;
>>>>>>>> +            tmp_inst.exec_size = lower_width;
>>>>>>>> +            tmp_inst.eot = inst->eot && i == n - 1;
>>>>>>>> +
>>>>>>>> +            /* Set exec_all if the lowered width is higher than the original
>>>>>>>> +             * to avoid breaking the compiler invariant that no control
>>>>>>>> +             * flow-masked instruction is wider than the shader's
>>>>>>>> +             * dispatch_width.  Then emit the lowered instruction.
>>>>>
>>>>> I don't think this is a standard invariant.  The standard invariant is
>>>>> "set exec_all if you're doing something that doesn't match the natural
>>>>> exec mask".  In fact, in the gen4 texturing code (which is what this
>>>>> is primarily for), we leave the writemask alone.
>>>>
>>>> The gen4 texturing code would have likely caused problems with
>>>> optimization passes that use the builder to emit new instructions based
>>>> on the channel enables of already existing instructions, precisely
>>>> because it used to break the invariant you mention (do something with an
>>>> exec mask higher than the natural without force_writemask_all).
>>>
>>> What do you mean by "higher than the natural"?
>>
>> I was trying to use your own words :P.  The Gen4 texturing code was
>> using an exec mask larger than the shader's dispatch_width-wide one, so
>> I assume it doesn't fall into the category of instructions matching the
>> "natural" exec mask.
>>
>>> What is an optimization pass doing with exec_size that would get
>>> messed up?
>>>
>>> The optimization passes and most of the compiler beyond the initial
>>> NIR -> FS pass shouldn't care about dispatch_width.  It should only
>>> ever care about the exec_sizes of particular instructions.  Hey, look,
>>> a SIMD16 instruction!  No big deal.  I really don't know what these
>>> mythical optimization problems would be.
>>>
>> The problem is that optimization passes start off with a
>> dispatch_width-wide fs_builder, which isn't usable to emit non-exec_all
>> SIMD16 instructions in a SIMD8 shader, so they're going to hit an
>> assertion if they accidentally try to apply some transformation or emit
>> new instructions based on the exec_size of the original texturing
>> instruction.
>>
>> A (somewhat disturbing) solution would be not to hand the original
>> dispatch_width-wide fs_builder to optimization passes, but a, say,
>> SIMD32 one, so they would be able to emit non-exec_all instructions of
>> arbitrary exec_size, with potentially undefined results.  I can do that
>> as PATCH 3.5 if you insist in getting rid of the exec_all() call in this
>> case.
>
> Heh...  While doing this I noticed a few cases during
> lowering/optimization in which we would emit instructions without
> initializing the execution size correctly.  Probably pre-existing bugs.
> I'll fix them in a separate series this one will depend on, and drop the
> exec_all() call from this patch.

For the sake of getting some patches landed, I'm ok with landing this
one as-is and then fixing it up once exec_all isn't needed.
--Jason

>>
>>>>>  If you do a SIMD16 instruction in SIMD8 mode without exec_all, you
>>>>> get garbage in the other 8 channels but it's otherwise fine.  For
>>>>> instructions we need to "expand", this should be fine since we're
>>>>> throwing away the extra channels.
>>>>>
>>>>> On the other hand, we don't know what the other half of the writemask
>>>>> will be (experimentation indicates that it's *not* a duplicate of the
>>>>> first half) so not setting exec_all doesn't really gain us anything.
>>>>> Unless, of course, we could maybe save some memory bandwidth in
>>>>> texturing instructions if some of the channels are disabled.  I doubt
>>>>> that matters much.
>>>>>
>>>> Yeah, in any case it can only have an effect on Gen4, and it will change
>>>> From writing garbage nondeterministically to some channels of the second
>>>> half to writing garbage deterministically.
>>>>
>>>>> My inclination would be to leave the exec_all alone since it's not
>>>>> needed, but I guess I'm not going to fight it too hard.
>>>>>
>>>> I wouldn't feel comfortable with doing that, it's surely going lead to
>>>> assertion failures in the future which will only be reproducible on Gen4
>>>> and will therefore not be obvious for the casual contributor unless it's
>>>> being tested on that specifically.  I guess the alternative would be to
>>>> relax the invariant, but it's proven to catch legitimate mistakes and
>>>> what the Gen4 code was doing had rather dubious semantics anyway, so I
>>>> doubt it's justified to change it...
>>>>
>>>>>>>> +             */
>>>>>>>> +            const fs_builder lbld = ibld.exec_all(lower_width > inst->exec_size)
>>>>>>>> +                                        .group(lower_width, i);
>>>>>>>> +            fs_inst *split_inst = lbld.emit(tmp_inst);
>>>>>>>
>>>>>>> Why are  you emitting the split instruction before the source
>>>>>>> transposes?  Don't they need to go first?  Maybe I'm missing
>>>>>>> something.
>>>>>>>
>>>>>> I guess it doesn't really matter in which order they are inserted as
>>>>>> long as they end up in the correct order inside the program (and as you
>>>>>> can see below there is an at() call making sure that the builder used to
>>>>>> emit the transposes is pointing before the split instruction).
>>>>>>
>>>>>> No important reason really, other than to be able to assign the
>>>>>> temporaries allocated below to the split instruction sources directly.
>>>>>
>>>>> That's fine.  It would be nice if you left a comment to that effect.
>>>>> Otherwise, it looks like you're emitting the send and then the moves.
>>>>> It's really easy when reading the code to miss the at() call below.
>>>>>
>>>> Sure.  It would also be easy to swap the order of the transposes and the
>>>> split_inst emission if you find it easier to understand that way.  I'll
>>>> just go do that instead of the comment if you have no objection.
>>>>
>>>>>>>> +
>>>>>>>> +            for (unsigned j = 0; j < inst->sources; j++) {
>>>>>>>> +               if (inst->src[j].file != BAD_FILE &&
>>>>>>>> +                   !is_uniform(inst->src[j])) {
>>>>>>>> +                  /* Get the i-th copy_width-wide chunk of the source. */
>>>>>>>> +                  const fs_reg src = horiz_offset(inst->src[j], copy_width * i);
>>>>>>>> +                  const unsigned src_size = inst->regs_read(j) * REG_SIZE /
>>>>>>>> +                     inst->src[j].component_size(inst->exec_size);
>>>>>>>> +
>>>>>>>> +                  /* Use a trivial transposition to copy one every n
>>>>>>>> +                   * copy_width-wide components of the register into a
>>>>>>>> +                   * temporary passed as source to the lowered instruction.
>>>>>>>> +                   */
>>>>>>>> +                  split_inst->src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>>>>>> +                  emit_transpose(lbld.group(copy_width, 0).at(block, split_inst),
>>>>>>>> +                                 split_inst->src[j], &src, 1, src_size, n);
>>>>>>>
>>>>>>> Shouldn't this be group(copy_width, i)?
>>>>>>>
>>>>>>
>>>>>> I'm already choosing the i-th channel group in the definition of lbld
>>>>>> while emitting the split instruction, this group() call is just to make
>>>>>> sure that the transposes don't copy data into the garbage channels of
>>>>>> the temporary in cases where copy_width is less than lower_width.
>>>>>
>>>>> Right.  That makes sense.
>>>>>
>>>>>>>> +               }
>>>>>>>> +            }
>>>>>>>> +
>>>>>>>> +            if (inst->regs_written) {
>>>>>>>> +               /* Allocate enough space to hold the result of the lowered
>>>>>>>> +                * instruction and fix up the number of registers written.
>>>>>>>> +                */
>>>>>>>> +               split_inst->dst = dsts[i] =
>>>>>>>> +                  lbld.vgrf(inst->dst.type, dst_size);
>>>>>>>> +               split_inst->regs_written =
>>>>>>>> +                  DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>>>>> +                               inst->exec_size);
>>>>>>>> +            }
>>>>>>>> +         }
>>>>>>>> +
>>>>>>>> +         if (inst->regs_written) {
>>>>>>>> +            /* Distance between useful channels in the temporaries, skipping
>>>>>>>> +             * garbage if the lowered instruction is wider than the original.
>>>>>>>> +             */
>>>>>>>> +            const unsigned m = lower_width / copy_width;
>>>>>>>> +
>>>>>>>> +            /* Interleave the components of the result from the lowered
>>>>>>>> +             * instructions.  We need to set exec_all() when copying more than
>>>>>>>> +             * one half per component, because LOAD_PAYLOAD (in terms of which
>>>>>>>> +             * emit_transpose is implemented) can only use the same channel
>>>>>>>> +             * enable signals for all of its non-header sources.
>>>>>>>> +             */
>>>>>>>> +            emit_transpose(ibld.exec_all(inst->exec_size > copy_width)
>>>>>>>> +                               .group(copy_width, 0),
>>>>>>>> +                           inst->dst, dsts, n, dst_size, m);
>>>>>
>>>>> While we're on the exec_all topic, I don't think it's needed here
>>>>> either.  In the case where exec_size > copy_width, we're throwing away
>>>>> the second half.  Therefore, the channels with the wrong exec mask
>>>>> will either not get emitted at all or dead-code will delete them (I'm
>>>>> not sure which without thinking about it harder).  In either case, the
>>>>> MOV's that do matter will have the right exec mask and exec_all
>>>>> doesn't gain us anything.
>>>>
>>>> Nope, the "exec_size > copy_width" case is the usual (non-gen4) case.
>>>> Nothing can be thrown away in that case because both halves have to be
>>>> copied interleaved into the destination register, so whatever execution
>>>> mask LOAD_PAYLOAD uses it has to work for both halves, which is only
>>>> possible if force_writemask_all is set.
>>>
>>> Right... Not a huge fan of exec_all'd LOAD_PAYLOAD, but we don't have
>>> a ZIP instruction so I won't complain too much.
>>>
>>>>>>>> +         }
>>>>>>>> +
>>>>>>>> +         inst->remove(block);
>>>>>>>> +         progress = true;
>>>>>>>> +      }
>>>>>>>> +   }
>>>>>>>> +
>>>>>>>> +   if (progress)
>>>>>>>> +      invalidate_live_intervals();
>>>>>>>> +
>>>>>>>> +   return progress;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  void
>>>>>>>>  fs_visitor::dump_instructions()
>>>>>>>>  {
>>>>>>>> @@ -3655,6 +3796,7 @@ fs_visitor::optimize()
>>>>>>>>     int iteration = 0;
>>>>>>>>     int pass_num = 0;
>>>>>>>>
>>>>>>>> +   OPT(lower_simd_width);
>>>>>>>>     OPT(lower_logical_sends);
>>>>>>>>
>>>>>>>>     do {
>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>> index f3850d1..9582648 100644
>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>> @@ -184,6 +184,7 @@ public:
>>>>>>>>     bool lower_load_payload();
>>>>>>>>     bool lower_logical_sends();
>>>>>>>>     bool lower_integer_multiplication();
>>>>>>>> +   bool lower_simd_width();
>>>>>>>>     bool opt_combine_constants();
>>>>>>>>
>>>>>>>>     void emit_dummy_fs();
>>>>>>>> --
>>>>>>>> 2.4.3
>>>>>>>>