[Mesa-dev] [PATCH 15/23] i965/fs: add SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA helper

Thu May 12 05:24:15 UTC 2016

Francisco Jerez <currojerez at riseup.net> writes:

> Iago Toral <itoral at igalia.com> writes:
>
>> On Tue, 2016-05-10 at 19:10 -0700, Francisco Jerez wrote:
>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>> 
>>> > From: Iago Toral Quiroga <itoral at igalia.com>
>>> >
>>> > There are a few places where we need to shuffle the result of a 32-bit load
>>> > into valid 64-bit data, so extract this logic into a separate helper that we
>>> > can reuse.
>>> >
>>> > Also, the shuffling needs to operate with WE_all set, which we were missing
>>> > before, because we are changing the layout of the data across the various
>>> > channels. Otherwise we will run into problems in non-uniform control-flow
>>> > scenarios.
>>> > ---
>>> >  src/mesa/drivers/dri/i965/brw_fs.cpp     | 95 +++++++++++++++++++++-----------
>>> >  src/mesa/drivers/dri/i965/brw_fs.h       |  5 ++
>>> >  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 46 ++--------------
>>> >  3 files changed, 73 insertions(+), 73 deletions(-)
>>> >
>>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> > index dff13ea..709e4b8 100644
>>> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> > @@ -216,39 +216,8 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const fs_builder &bld,
>>> >  
>>> >     vec4_result.type = dst.type;
>>> >  
>>> > -   /* Our VARYING_PULL_CONSTANT_LOAD reads a vector of 32-bit elements. If we
>>> > -    * are reading doubles this means that we get this:
>>> > -    *
>>> > -    *  r0: x0 x0 x0 x0 x0 x0 x0 x0
>>> > -    *  r1: x1 x1 x1 x1 x1 x1 x1 x1
>>> > -    *  r2: y0 y0 y0 y0 y0 y0 y0 y0
>>> > -    *  r3: y1 y1 y1 y1 y1 y1 y1 y1
>>> > -    *
>>> > -    * Fix this up so we return valid double elements:
>>> > -    *
>>> > -    *  r0: x0 x1 x0 x1 x0 x1 x0 x1
>>> > -    *  r1: x0 x1 x0 x1 x0 x1 x0 x1
>>> > -    *  r2: y0 y1 y0 y1 y0 y1 y0 y1
>>> > -    *  r3: y0 y1 y0 y1 y0 y1 y0 y1
>>> > -    */
>>> > -   if (type_sz(dst.type) == 8) {
>>> > -      int multiplier = bld.dispatch_width() / 8;
>>> > -      fs_reg fixed_res =
>>> > -         fs_reg(VGRF, alloc.allocate(2 * multiplier), BRW_REGISTER_TYPE_F);
>>> > -      /* We only have 2 doubles in a 32-bit vec4 */
>>> > -      for (int i = 0; i < 2; i++) {
>>> > -         fs_reg vec4_float =
>>> > -            horiz_offset(retype(vec4_result, BRW_REGISTER_TYPE_F),
>>> > -                         multiplier * 16 * i);
>>> > -
>>> > -         bld.MOV(stride(fixed_res, 2), vec4_float);
>>> > -         bld.MOV(stride(horiz_offset(fixed_res, 1), 2),
>>> > -                 horiz_offset(vec4_float, 8 * multiplier));
>>> > -
>>> > -         bld.MOV(horiz_offset(vec4_result, multiplier * 8 * i),
>>> > -                 retype(fixed_res, BRW_REGISTER_TYPE_DF));
>>> > -      }
>>> > -   }
>>> > +   if (type_sz(dst.type) == 8)
>>> > +      SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA(bld, vec4_result, vec4_result, 2);
>>> >  
>>> >     int type_slots = MAX2(type_sz(dst.type) / 4, 1);
>>> >     bld.MOV(dst, offset(vec4_result, bld,
>>> > @@ -256,6 +225,66 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const fs_builder &bld,
>>> >  }
>>> >  
>>> >  /**
>>> > + * This helper takes the result of a load operation that reads 32-bit elements
>>> > + * in this format:
>>> > + *
>>> > + * x x x x x x x x
>>> > + * y y y y y y y y
>>> > + * z z z z z z z z
>>> > + * w w w w w w w w
>>> > + *
>>> > + * and shuffles the data to get this:
>>> > + *
>>> > + * x y x y x y x y
>>> > + * x y x y x y x y
>>> > + * z w z w z w z w
>>> > + * z w z w z w z w
>>> > + *
>>> > + * Which is exactly what we want if the load is reading 64-bit components
>>> > + * like doubles, where x represents the low 32-bit of the x double component
>>> > + * and y represents the high 32-bit of the x double component (likewise with
>>> > + * z and w for double component y). The parameter @components represents
>>> > + * the number of 64-bit components present in @src. This would typically be
>>> > + * 2 at most, since we can only fit 2 double elements in the result of a
>>> > + * vec4 load.
>>> > + *
>>> > + * Notice that @dst and @src can be the same register.
>>> > + */
>>> > +void
>>> > +fs_visitor::SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA(const fs_builder &bld,
>>> 
>>> I don't see any reason to make this an fs_visitor method.  Declare this
>>> as a static function local to brw_fs_nir.cpp what should improve
>>> encapsulation and reduce the amount of boilerplate.  Also please don't
>>> write it in capitals unless you want people to shout the name of your
>>> function while discussing out loud about it. ;)
>>> 
>>> > +                                                    const fs_reg dst,
>>> > +                                                    const fs_reg src,
>>> > +                                                    uint32_t components)
>>> > +{
>>> > +   int multiplier = bld.dispatch_width() / 8;
>>> 
>>> This definition is redundant with the changes below taken into account.
>>> 
>>> > +
>>> > +   /* A temporary that we will use to shuffle the 32-bit data of each
>>> > +    * component in the vector into valid 64-bit data
>>> > +    */
>>> > +   fs_reg tmp =
>>> > +      fs_reg(VGRF, alloc.allocate(2 * multiplier), BRW_REGISTER_TYPE_F);
>>> 
>>> I don't think there is any reason to do this inside a temporary instead
>>> of writing into the destination register directly.
>>> 
>>> > +
>>> > +   /* We are going to manipulate the data in elements of 32-bit */
>>> > +   fs_reg src_data = retype(src, BRW_REGISTER_TYPE_F);
>>> > +
>>> > +   /* We are going to manipulate the dst in elements of 64-bit */
>>> > +   fs_reg dst_data = retype(dst, BRW_REGISTER_TYPE_DF);
>>> > +
>>> 
>>> I would turn these into assertions on the register types rather than
>>> overriding the type passed by the caller -- If the caller didn't intend
>>> 'dst' to be a vector of 64bit element type or if it didn't intend 'src'
>>> to be a vector of 32bit type this will simply ignore what the caller's
>>> intention and emit incorrect code.  We should probably kill the program
>>> with an assertion failure instead so the problem doesn't go unnoticed.
>>
>> I've been thinking a bit more about this and I am not convinced about
>> changing these retypes to asserts. My thinking is that these retypes are
>> only necessary to support the shuffling logic implemented here, so it is
>
> The point is that the logic you implement imposes certain restrictions
> on the layout of the source and destination: The source of the
> instruction *must* already be an array of 2*n 32-bit things, while the
> destination *must* be an array of n 64-bit things, if the user does it
> backwards (e.g. because it's using the wrong SHUFFLE function -- I admit
> the names confused me somewhat until I had read through and understood
> the definition of both functions) you'll end up reading from the wrong
> vector components and miscompiling the program silently.  That's why I
> was thinking that it would be more useful to do something like:
>
> | assert(type_sz(src.type) == 4 && type_sz(dst.type) == 8);
>
> and the converse in the other SHUFFLE function.  Once you've done that I
> don't think it's necessary to impose any restrictions on the actual type
> -- It looks like the exact same function will be useful to handle 64
> integer types as well.
>
>> kind of an internal helper more than a requirement on the input types.
>> In fact, we call this function with src and dst being the same in
>> various places because in the end we just want the exact same data only
>> that shuffled around in a specific way. If we force specific types on
>> the dst and src I think we are going to end up having to retype in the
>> calls we do to the shuffling functions only so we fit the requirement of
>> the assert and not really because that is what we want to do (from the
>> perspective of the caller).
>>
> Yes please, typing the arguments explicitly at the caller sounds like an
> improvement since it would make it clear how your data is laid out in
> each of the registers passed in as arguments.
>
BTW, it's not like it will cost you anything to retype at the caller,
because the destination of the SHUFFLE instruction *must* be being
treated by the caller as a vector of 64bit element type (Which means the
fs_reg will have a 64 bit type already *or* it will be retyped to a 64
bit type immediately afterwards anyway -- If you do neither of these you
most likely have a bug).  The same goes for the source (with a 32bit
type instead).  My suggestion above should just give you the guarantee
that caller and callee agree on the component layout of the arguments in
all cases.

>> FWIW, if someone called this with actual 32-bit data instead of 64-bit
>> data, that is going to break because the code will process more
>> components than what is available in the source or destination regions
>> and that would raise an assertion eventually because we produce reads or
>> writes that go beyond the VGRF space allocated for these registers. So
>> in the cases you mention I think we would be getting an assertion
>> anyway.
>>
> In the scenario I had in mind (calling the wrong SHUFFLE function) that
> wouldn't help because both source and destination have the same overall
> amount of data which is laid out as a different number of components of
> different types.
>
>>> > +   /* Shuffle the data */
>>> > +   for (unsigned i = 0; i < components; i++) {
>>> > +      fs_reg component_i = horiz_offset(src_data, multiplier * 16 * i);
>>> 
>>> Mark as 'const'.  And horiz_offset() isn't really meant to step out of
>>> the given SIMD-wide vector, it will work or not depending on what the
>>> stride value is.  You should be using offset(src_data, bld, 2 * i)
>>> instead which takes the SIMD width into account correctly for you.
>>> 
>>> > +
>>> > +      bld.MOV(stride(tmp, 2), component_i)->force_writemask_all = true;
>>> 
>>> Looks like you still need to update this to use subscript().  The
>>> force_writemask_all flag shouldn't be necessary because you won't be
>>> crossing channel boundaries at all.
>>> 
>>> > +      bld.MOV(stride(horiz_offset(tmp, 1), 2),
>>> > +              horiz_offset(component_i, 8 * multiplier))
>>> > +                 ->force_writemask_all = true;
>>> > +
>>> 
>>> Use 'subscript(...)' instead of 'stride(...)', 'offset(component_i, bld,
>>> 1)' instead of 'horiz_offset(...)', and drop the force_writemask_all.
>>> 
>>> 
>>> > +      bld.MOV(horiz_offset(dst_data, multiplier * 8 * i),
>>> > +              retype(tmp, BRW_REGISTER_TYPE_DF))->force_writemask_all = true;
>>> 
>>> Use 'offset(dst, bld, i)' instead of 'horiz_offset()' to index the
>>> destination, and fold it into the instructions above (or declare a
>>> temporary as you did for component_i).
>>> 
>>> I notice that the comments above amount to an almost full rewrite of the
>>> patch -- Would you mind sending a v2 before I give my R-b?
>>> 
>>> Thanks.
>>> 
>>> > +   }
>>> > +}
>>> > +
>>> > +/**
>>> >   * A helper for MOV generation for fixing up broken hardware SEND dependency
>>> >   * handling.
>>> >   */
>>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>>> > index 1f18112..c95b30a 100644
>>> > --- a/src/mesa/drivers/dri/i965/brw_fs.h
>>> > +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>>> > @@ -106,6 +106,11 @@ public:
>>> >                                     uint32_t const_offset);
>>> >     void DEP_RESOLVE_MOV(const brw::fs_builder &bld, int grf);
>>> >  
>>> > +   void SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA(const brw::fs_builder &bld,
>>> > +                                                const fs_reg dst,
>>> > +                                                const fs_reg src,
>>> > +                                                uint32_t components);
>>> > +
>>> >     bool run_fs(bool do_rep_send);
>>> >     bool run_vs(gl_clip_plane *clip_planes);
>>> >     bool run_tes();
>>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> > index 5a12d63..9cd751b 100644
>>> > --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> > +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> > @@ -3082,44 +3082,19 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>> >           for (int i = 0; i < instr->num_components; i++)
>>> >              bld.MOV(offset(dest, bld, i), offset(read_result, bld, i));
>>> >        } else {
>>> > -         /* Reading a dvec, then the untyped read below gives us this:
>>> > +         /* Reading a dvec, so we need to:
>>> >            *
>>> > -          * x0 x0 x0 x0 x0 x0 x0 x0
>>> > -          * x1 x1 x1 x1 x1 x1 x1 x1
>>> > -          * y0 y0 y0 y0 y0 y0 y0 y0
>>> > -          * y1 y1 y1 y1 y1 y1 y1 y1
>>> > -          *
>>> > -          * But that is not valid 64-bit data, instead we want:
>>> > -          *
>>> > -          * x0 x1 x0 x1 x0 x1 x0 x1
>>> > -          * x0 x1 x0 x1 x0 x1 x0 x1
>>> > -          * y0 y1 y0 y1 y0 y1 y0 y1
>>> > -          * y0 y1 y0 y1 y0 y1 y0 y1
>>> > -          *
>>> > -          * Also, that would only load half of a dvec4. So, we have to:
>>> > -          *
>>> > -          * 1. Multiply num_components by 2, to account for the fact that we need
>>> > -          *    to read 64-bit components.
>>> > +          * 1. Multiply num_components by 2, to account for the fact that we
>>> > +          *    need to read 64-bit components.
>>> >            * 2. Shuffle the result of the load to form valid 64-bit elements
>>> >            * 3. Emit a second load (for components z/w) if needed.
>>> > -          *
>>> > -          * FIXME: extract the shuffling logic and share it with
>>> > -          *        varying_pull_constant_load
>>> >            */
>>> > -         int multiplier = bld.dispatch_width() / 8;
>>> > -
>>> >           fs_reg read_offset = vgrf(glsl_type::uint_type);
>>> >           bld.MOV(read_offset, offset_reg);
>>> >  
>>> >           int num_components = instr->num_components;
>>> >           int iters = num_components <= 2 ? 1 : 2;
>>> >  
>>> > -         /* A temporary that we will use to shuffle the 32-bit data of each
>>> > -          * component in the vector into valid 64-bit data
>>> > -          */
>>> > -         fs_reg tmp =
>>> > -            fs_reg(VGRF, alloc.allocate(2 * multiplier), BRW_REGISTER_TYPE_F);
>>> > -
>>> >           /* Load the dvec, the first iteration loads components x/y, the second
>>> >            * iteration, if needed, loads components z/w
>>> >            */
>>> > @@ -3138,18 +3113,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>> >              read_result.type = BRW_REGISTER_TYPE_F;
>>> >  
>>> >              /* Shuffle the 32-bit load result into valid 64-bit data */
>>> > -            int multiplier = bld.dispatch_width() / 8;
>>> > -            for (int i = 0; i < iter_components; i++) {
>>> > -               fs_reg component_i =
>>> > -                  horiz_offset(read_result, multiplier * 16 * i);
>>> > -
>>> > -               bld.MOV(stride(tmp, 2), component_i);
>>> > -               bld.MOV(stride(horiz_offset(tmp, 1), 2),
>>> > -                       horiz_offset(component_i, 8 * multiplier));
>>> > -
>>> > -               bld.MOV(horiz_offset(dest, multiplier * 8 * (i + it * 2)),
>>> > -                       retype(tmp, BRW_REGISTER_TYPE_DF));
>>> > -            }
>>> > +            SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA(bld,
>>> > +                                                    offset(dest, bld, it * 2),
>>> > +                                                    read_result, iter_components);
>>> >  
>>> >              bld.ADD(read_offset, read_offset, brw_imm_ud(16));
>>> >           }
>>> > -- 
>>> > 2.5.0
>>> >
>>> > _______________________________________________
>>> > mesa-dev mailing list
>>> > mesa-dev at lists.freedesktop.org
>>> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160511/4b0d7568/attachment-0001.sig>