[Mesa-dev] [PATCH 4/4] i965/fs: Implement pass to lower instructions of unsupported SIMD width.
Francisco Jerez
currojerez at riseup.net
Thu Jul 23 09:24:40 PDT 2015
Jason Ekstrand <jason at jlekstrand.net> writes:
> On Thu, Jul 23, 2015 at 3:55 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Francisco Jerez <currojerez at riseup.net> writes:
>>
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>>> On Wed, Jul 22, 2015 at 10:05 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>>
>>>>>> On Wed, Jul 22, 2015 at 12:31 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>>>>
>>>>>>>> A few comments below. Mostly just asking for explanation.
>>>>>>>>
>>>>>>>> 1-3 are
>>>>>>>>
>>>>>>>> Reviewed-by: Jason Ekstrand <jason.ekstrand at intel.com>
>>>>>>>>
>>>>>>>> Obviously, don't merge 4/4 until it actually has users.
>>>>>>>> --Jason
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2015 at 8:35 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>>>>> This lowering pass implements an algorithm to expand SIMDN
>>>>>>>>> instructions into a sequence of SIMDM instructions in cases where the
>>>>>>>>> hardware doesn't support the original execution size natively for some
>>>>>>>>> particular instruction. The most important use-cases are:
>>>>>>>>>
>>>>>>>>> - Lowering send message instructions that don't support SIMD16
>>>>>>>>> natively into SIMD8 (several texturing, framebuffer write and typed
>>>>>>>>> surface operations).
>>>>>>>>>
>>>>>>>>> - Lowering messages that don't support SIMD8 natively into SIMD16
>>>>>>>>> (*cough*gen4*cough*).
>>>>>>>>>
>>>>>>>>> - 64-bit precision operations (e.g. FP64 and 64-bit integer
>>>>>>>>> multiplication).
>>>>>>>>>
>>>>>>>>> - SIMD32.
>>>>>>>>>
>>>>>>>>> The algorithm works by splitting the sources of the original
>>>>>>>>> instruction into chunks of width appropriate for the lowered
>>>>>>>>> instructions, and then interleaving the results component-wise into
>>>>>>>>> the destination of the original instruction. The pass is controlled
>>>>>>>>> by the get_lowered_simd_width() function that currently just returns
>>>>>>>>> the original execution size making the whole pass a no-op for the
>>>>>>>>> moment until some user is introduced.
>>>>>>>>> ---
>>>>>>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 142 +++++++++++++++++++++++++++++++++++
>>>>>>>>> src/mesa/drivers/dri/i965/brw_fs.h | 1 +
>>>>>>>>> 2 files changed, 143 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>>> index d031352..eeb6938 100644
>>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>>> @@ -3204,6 +3204,147 @@ fs_visitor::lower_logical_sends()
>>>>>>>>> return progress;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * Get the closest native SIMD width supported by the hardware for instruction
>>>>>>>>> + * \p inst. The instruction will be left untouched by
>>>>>>>>> + * fs_visitor::lower_simd_width() if the returned value is equal to the
>>>>>>>>> + * original execution size.
>>>>>>>>> + */
>>>>>>>>> +static unsigned
>>>>>>>>> +get_lowered_simd_width(const struct brw_device_info *devinfo,
>>>>>>>>> + const fs_inst *inst)
>>>>>>>>> +{
>>>>>>>>> + switch (inst->opcode) {
>>>>>>>>> + default:
>>>>>>>>> + return inst->exec_size;
>>>>>>>>> + }
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * The \p rows array of registers represents a \p num_rows by \p num_columns
>>>>>>>>> + * matrix in row-major order, write it in column-major order into the register
>>>>>>>>> + * passed as destination. \p stride gives the separation between matrix
>>>>>>>>> + * elements in the input in fs_builder::dispatch_width() units.
>>>>>>>>> + */
>>>>>>>>> +static void
>>>>>>>>> +emit_transpose(const fs_builder &bld,
>>>>>>>>> + const fs_reg &dst, const fs_reg *rows,
>>>>>>>>> + unsigned num_rows, unsigned num_columns, unsigned stride)
>>>>>>>>
>>>>>>>> I'm not sure what I think about calling this "emit_transpose". I
>>>>>>>> guess it is kind of a transpose, but maybe it's more of a "gather".
>>>>>>>> I'm not going to quibble about it though.
>>>>>>>
>>>>>>> *Shrug*, it doesn't only gather the vectors of the rows array (that
>>>>>>> would have been emit_collect :P), it copies them out in vertical, just
>>>>>>> like a matrix transpose -- assuming you're not horrified by the idea of
>>>>>>> considering the argument a matrix.
>>>>>>
>>>>>> That's fine. Feel free to ignore that suggestion.
>>>>>>
>>>>>>>>
>>>>>>>>> +{
>>>>>>>>> + fs_reg *const components = new fs_reg[num_rows * num_columns];
>>>>>>>>> +
>>>>>>>>> + for (unsigned i = 0; i < num_columns; ++i) {
>>>>>>>>> + for (unsigned j = 0; j < num_rows; ++j)
>>>>>>>>> + components[num_rows * i + j] = offset(rows[j], bld, stride * i);
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + bld.LOAD_PAYLOAD(dst, components, num_rows * num_columns, 0);
>>>>>>>>> +
>>>>>>>>> + delete[] components;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +bool
>>>>>>>>> +fs_visitor::lower_simd_width()
>>>>>>>>> +{
>>>>>>>>> + bool progress = false;
>>>>>>>>> +
>>>>>>>>> + foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>>>>>>>> + const unsigned lower_width = get_lowered_simd_width(devinfo, inst);
>>>>>>>>> +
>>>>>>>>> + if (lower_width != inst->exec_size) {
>>>>>>>>> + /* Builder matching the original instruction. */
>>>>>>>>> + const fs_builder ibld = bld.at(block, inst)
>>>>>>>>> + .exec_all(inst->force_writemask_all)
>>>>>>>>> + .group(inst->exec_size, inst->force_sechalf);
>>>>>>>>> +
>>>>>>>>> + /* Split the copies in chunks of the execution width of either the
>>>>>>>>> + * original or the lowered instruction, whichever is lower.
>>>>>>>>> + */
>>>>>>>>> + const unsigned copy_width = MIN2(lower_width, inst->exec_size);
>>>>>>>>> + const unsigned n = inst->exec_size / copy_width;
>>>>>>>>> + const unsigned dst_size = inst->regs_written * REG_SIZE /
>>>>>>>>> + inst->dst.component_size(inst->exec_size);
>>>>>>>>> + fs_reg dsts[4];
>>>>>>>>> +
>>>>>>>>> + assert(n > 0 && n <= ARRAY_SIZE(dsts) &&
>>>>>>>>> + !inst->writes_accumulator && !inst->mlen);
>>>>>>>>> +
>>>>>>>>> + for (unsigned i = 0; i < n; i++) {
>>>>>>>>> + /* Emit a copy of the original instruction with the lowered width.
>>>>>>>>> + * If the EOT flag was set throw it away except for the last
>>>>>>>>> + * instruction to avoid killing the thread prematurely.
>>>>>>>>> + */
>>>>>>>>> + fs_inst tmp_inst = *inst;
>>>>>>>>> + tmp_inst.exec_size = lower_width;
>>>>>>>>> + tmp_inst.eot = inst->eot && i == n - 1;
>>>>>>>>> +
>>>>>>>>> + /* Set exec_all if the lowered width is higher than the original
>>>>>>>>> + * to avoid breaking the compiler invariant that no control
>>>>>>>>> + * flow-masked instruction is wider than the shader's
>>>>>>>>> + * dispatch_width. Then emit the lowered instruction.
>>>>>>
>>>>>> I don't think this is a standard invariant. The standard invariant is
>>>>>> "set exec_all if you're doing something that doesn't match the natural
>>>>>> exec mask". In fact, in the gen4 texturing code (which is what this
>>>>>> is primarily for), we leave the writemask alone.
>>>>>
>>>>> The gen4 texturing code would have likely caused problems with
>>>>> optimization passes that use the builder to emit new instructions based
>>>>> on the channel enables of already existing instructions, precisely
>>>>> because it used to break the invariant you mention (do something with an
>>>>> exec mask higher than the natural without force_writemask_all).
>>>>
>>>> What do you mean by "higher than the natural"?
>>>
>>> I was trying to use your own words :P. The Gen4 texturing code was
>>> using an exec mask larger than the shader's dispatch_width-wide one, so
>>> I assume it doesn't fall into the category of instructions matching the
>>> "natural" exec mask.
>>>
>>>> What is an optimization pass doing with exec_size that would get
>>>> messed up?
>>>>
>>>> The optimization passes and most of the compiler beyond the initial
>>>> NIR -> FS pass shouldn't care about dispatch_width. It should only
>>>> ever care about the exec_sizes of particular instructions. Hey, look,
>>>> a SIMD16 instruction! No big deal. I really don't know what these
>>>> mythical optimization problems would be.
>>>>
>>> The problem is that optimization passes start off with a
>>> dispatch_width-wide fs_builder, which isn't usable to emit non-exec_all
>>> SIMD16 instructions in a SIMD8 shader, so they're going to hit an
>>> assertion if they accidentally try to apply some transformation or emit
>>> new instructions based on the exec_size of the original texturing
>>> instruction.
>>>
>>> A (somewhat disturbing) solution would be not to hand the original
>>> dispatch_width-wide fs_builder to optimization passes, but a, say,
>>> SIMD32 one, so they would be able to emit non-exec_all instructions of
>>> arbitrary exec_size, with potentially undefined results. I can do that
>>> as PATCH 3.5 if you insist in getting rid of the exec_all() call in this
>>> case.
>>
>> Heh... While doing this I noticed a few cases during
>> lowering/optimization in which we would emit instructions without
>> initializing the execution size correctly. Probably pre-existing bugs.
>> I'll fix them in a separate series this one will depend on, and drop the
>> exec_all() call from this patch.
>
> For the sake of getting some patches landed, I'm ok with landing this
> one as-is and then fixing it up once exec_all isn't needed.
Sounds good to me, I'll do it as a follow-up.
> --Jason
>
>>>
>>>>>> If you do a SIMD16 instruction in SIMD8 mode without exec_all, you
>>>>>> get garbage in the other 8 channels but it's otherwise fine. For
>>>>>> instructions we need to "expand", this should be fine since we're
>>>>>> throwing away the extra channels.
>>>>>>
>>>>>> On the other hand, we don't know what the other half of the writemask
>>>>>> will be (experimentation indicates that it's *not* a duplicate of the
>>>>>> first half) so not setting exec_all doesn't really gain us anything.
>>>>>> Unless, of course, we could maybe save some memory bandwidth in
>>>>>> texturing instructions if some of the channels are disabled. I doubt
>>>>>> that matters much.
>>>>>>
>>>>> Yeah, in any case it can only have an effect on Gen4, and it will change
>>>>> From writing garbage nondeterministically to some channels of the second
>>>>> half to writing garbage deterministically.
>>>>>
>>>>>> My inclination would be to leave the exec_all alone since it's not
>>>>>> needed, but I guess I'm not going to fight it too hard.
>>>>>>
>>>>> I wouldn't feel comfortable with doing that, it's surely going lead to
>>>>> assertion failures in the future which will only be reproducible on Gen4
>>>>> and will therefore not be obvious for the casual contributor unless it's
>>>>> being tested on that specifically. I guess the alternative would be to
>>>>> relax the invariant, but it's proven to catch legitimate mistakes and
>>>>> what the Gen4 code was doing had rather dubious semantics anyway, so I
>>>>> doubt it's justified to change it...
>>>>>
>>>>>>>>> + */
>>>>>>>>> + const fs_builder lbld = ibld.exec_all(lower_width > inst->exec_size)
>>>>>>>>> + .group(lower_width, i);
>>>>>>>>> + fs_inst *split_inst = lbld.emit(tmp_inst);
>>>>>>>>
>>>>>>>> Why are you emitting the split instruction before the source
>>>>>>>> transposes? Don't they need to go first? Maybe I'm missing
>>>>>>>> something.
>>>>>>>>
>>>>>>> I guess it doesn't really matter in which order they are inserted as
>>>>>>> long as they end up in the correct order inside the program (and as you
>>>>>>> can see below there is an at() call making sure that the builder used to
>>>>>>> emit the transposes is pointing before the split instruction).
>>>>>>>
>>>>>>> No important reason really, other than to be able to assign the
>>>>>>> temporaries allocated below to the split instruction sources directly.
>>>>>>
>>>>>> That's fine. It would be nice if you left a comment to that effect.
>>>>>> Otherwise, it looks like you're emitting the send and then the moves.
>>>>>> It's really easy when reading the code to miss the at() call below.
>>>>>>
>>>>> Sure. It would also be easy to swap the order of the transposes and the
>>>>> split_inst emission if you find it easier to understand that way. I'll
>>>>> just go do that instead of the comment if you have no objection.
>>>>>
>>>>>>>>> +
>>>>>>>>> + for (unsigned j = 0; j < inst->sources; j++) {
>>>>>>>>> + if (inst->src[j].file != BAD_FILE &&
>>>>>>>>> + !is_uniform(inst->src[j])) {
>>>>>>>>> + /* Get the i-th copy_width-wide chunk of the source. */
>>>>>>>>> + const fs_reg src = horiz_offset(inst->src[j], copy_width * i);
>>>>>>>>> + const unsigned src_size = inst->regs_read(j) * REG_SIZE /
>>>>>>>>> + inst->src[j].component_size(inst->exec_size);
>>>>>>>>> +
>>>>>>>>> + /* Use a trivial transposition to copy one every n
>>>>>>>>> + * copy_width-wide components of the register into a
>>>>>>>>> + * temporary passed as source to the lowered instruction.
>>>>>>>>> + */
>>>>>>>>> + split_inst->src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>>>>>>> + emit_transpose(lbld.group(copy_width, 0).at(block, split_inst),
>>>>>>>>> + split_inst->src[j], &src, 1, src_size, n);
>>>>>>>>
>>>>>>>> Shouldn't this be group(copy_width, i)?
>>>>>>>>
>>>>>>>
>>>>>>> I'm already choosing the i-th channel group in the definition of lbld
>>>>>>> while emitting the split instruction, this group() call is just to make
>>>>>>> sure that the transposes don't copy data into the garbage channels of
>>>>>>> the temporary in cases where copy_width is less than lower_width.
>>>>>>
>>>>>> Right. That makes sense.
>>>>>>
>>>>>>>>> + }
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + if (inst->regs_written) {
>>>>>>>>> + /* Allocate enough space to hold the result of the lowered
>>>>>>>>> + * instruction and fix up the number of registers written.
>>>>>>>>> + */
>>>>>>>>> + split_inst->dst = dsts[i] =
>>>>>>>>> + lbld.vgrf(inst->dst.type, dst_size);
>>>>>>>>> + split_inst->regs_written =
>>>>>>>>> + DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>>>>>> + inst->exec_size);
>>>>>>>>> + }
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + if (inst->regs_written) {
>>>>>>>>> + /* Distance between useful channels in the temporaries, skipping
>>>>>>>>> + * garbage if the lowered instruction is wider than the original.
>>>>>>>>> + */
>>>>>>>>> + const unsigned m = lower_width / copy_width;
>>>>>>>>> +
>>>>>>>>> + /* Interleave the components of the result from the lowered
>>>>>>>>> + * instructions. We need to set exec_all() when copying more than
>>>>>>>>> + * one half per component, because LOAD_PAYLOAD (in terms of which
>>>>>>>>> + * emit_transpose is implemented) can only use the same channel
>>>>>>>>> + * enable signals for all of its non-header sources.
>>>>>>>>> + */
>>>>>>>>> + emit_transpose(ibld.exec_all(inst->exec_size > copy_width)
>>>>>>>>> + .group(copy_width, 0),
>>>>>>>>> + inst->dst, dsts, n, dst_size, m);
>>>>>>
>>>>>> While we're on the exec_all topic, I don't think it's needed here
>>>>>> either. In the case where exec_size > copy_width, we're throwing away
>>>>>> the second half. Therefore, the channels with the wrong exec mask
>>>>>> will either not get emitted at all or dead-code will delete them (I'm
>>>>>> not sure which without thinking about it harder). In either case, the
>>>>>> MOV's that do matter will have the right exec mask and exec_all
>>>>>> doesn't gain us anything.
>>>>>
>>>>> Nope, the "exec_size > copy_width" case is the usual (non-gen4) case.
>>>>> Nothing can be thrown away in that case because both halves have to be
>>>>> copied interleaved into the destination register, so whatever execution
>>>>> mask LOAD_PAYLOAD uses it has to work for both halves, which is only
>>>>> possible if force_writemask_all is set.
>>>>
>>>> Right... Not a huge fan of exec_all'd LOAD_PAYLOAD, but we don't have
>>>> a ZIP instruction so I won't complain too much.
>>>>
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + inst->remove(block);
>>>>>>>>> + progress = true;
>>>>>>>>> + }
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + if (progress)
>>>>>>>>> + invalidate_live_intervals();
>>>>>>>>> +
>>>>>>>>> + return progress;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> void
>>>>>>>>> fs_visitor::dump_instructions()
>>>>>>>>> {
>>>>>>>>> @@ -3655,6 +3796,7 @@ fs_visitor::optimize()
>>>>>>>>> int iteration = 0;
>>>>>>>>> int pass_num = 0;
>>>>>>>>>
>>>>>>>>> + OPT(lower_simd_width);
>>>>>>>>> OPT(lower_logical_sends);
>>>>>>>>>
>>>>>>>>> do {
>>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>>> index f3850d1..9582648 100644
>>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>>>>>>>>> @@ -184,6 +184,7 @@ public:
>>>>>>>>> bool lower_load_payload();
>>>>>>>>> bool lower_logical_sends();
>>>>>>>>> bool lower_integer_multiplication();
>>>>>>>>> + bool lower_simd_width();
>>>>>>>>> bool opt_combine_constants();
>>>>>>>>>
>>>>>>>>> void emit_dummy_fs();
>>>>>>>>> --
>>>>>>>>> 2.4.3
>>>>>>>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150723/e36535f0/attachment-0001.sig>
More information about the mesa-dev
mailing list