[Mesa-dev] [PATCH 05/22] i965/fs: consider execsize can be duplicated in lower_simd_with

Wed Jan 11 08:50:31 UTC 2017

On Tue, 2017-01-10 at 14:31 -0800, Francisco Jerez wrote:
> "Juan A. Suarez Romero" <jasuarez at igalia.com> writes:
> 
> > On Mon, 2017-01-09 at 15:41 -0800, Francisco Jerez wrote:
> > > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> > > 
> > > > From: "Juan A. Suarez Romero" <jasuarez at igalia.com>
> > > > 
> > > > In IVB/VLV, for instructions dealing with DF, execsize will be
> > > > duplicated in the final code.
> > > > 
> > > > So take this in account when checking if instructions should be
> > > > split.
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_fs.cpp | 13 ++++++++++++-
> > > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > index 78f2124..cfce364 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > @@ -4551,8 +4551,15 @@ static unsigned
> > > >  get_fpu_lowered_simd_width(const struct gen_device_info *devinfo,
> > > >                             const fs_inst *inst)
> > > >  {
> > > > +   /* Note that in IVB/VLV for instructions that handles DF, we
> > > > will duplicate
> > > > +    * the exec_size. So take this value for calculus purposes.
> > > > +    */
> > > > +   unsigned exec_size = inst->exec_size;
> > > > +   if (devinfo->gen == 7 && !devinfo->is_haswell && inst-
> > > > > exec_data_size() == 8)
> > > > 
> > > > +      exec_size *= 2;
> > > > +
> > > 
> > > NAK.  This won't have any useful effect because the max_width value
> > > doesn't actually have any influence in the regioning restriction
> > > calculations below, it's only used to accumulate the intermediate
> > > execution size limits derived from each hardware restriction.  I
> > > don't
> > > think you need to change any of the restrictions below for the result
> > > to
> > > be correct in presence of exec size doubling.
> > 
> > 
> > 
> > Yes, you're right.
> > 
> > Actually the motivation for the patch was to avoid ending up with a
> > final exec_size > 8  (after doubling it in the generator) when we are
> > generating a SIMD8 shader (or exec_size > 16 in SIMD16 shader).
> > 
> > So if we have a exec_size == 8, and we are generating SIMD8 code, then
> > we want to split in two instructions of exec_size==4, because at the
> > end we will double the exec_size to 8. Same reasoning with SIMD16 code.
> > 
> 
> You're fine with such execution sizes as long asyou don't violate the 2
> GRFs per source or destination region hardware limit (and a few other
> crazy restrictions) which are only affected by the *effective* execution
> size of the instruction and don't care at all about the units the
> execution size is expressed in at the machine code level (e.g. the total
> amount of data accessed by a source region of the instruction is the
> same regardless of whether you consider it to be a sequence of N
> elements of size S or a sequence of 2*N elements of size S/2).
> 

OK, I get it. Then, let's drop this patch.

Thanks for the feedback.

> > 
> > This could be done as:
> > 
> > @@ -5119,7 +5119,11 @@ fs_visitor::lower_simd_width()
> >     bool progress = false;
> >  
> >     foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
> > -      const unsigned lower_width = get_lowered_simd_width(devinfo,
> > inst);
> > +      unsigned lower_width = get_lowered_simd_width(devinfo, inst);
> > +
> > +      if (devinfo->gen == 7 && !devinfo->is_haswell &&
> > +          (get_exec_type_size(inst) == 8 || type_sz(inst->dst.type) ==
> > 8))
> > +         lower_width = MIN2(lower_width, dispatch_width / 2);
> > 
> 
> That wouldn't do any good...  Let's just drop this patch.
> 

OK, let's drop the entire patch.

> > 
> > Now, this is all in theory. So far, no test has exposed this
> > situation with this change of top of our branch, and the lower_width
> > returned by get_lowered_simd_width() is always less than 8. But maybe
> > this could happen in future, for instance, if we emit an instruction
> > that writes more than 4 DF and has force_writemask_all == TRUE.
> > 
> > > 
> > > >     /* Maximum execution size representable in the instruction controls. */
> > > > -   unsigned max_width = MIN2(32, inst->exec_size);
> > > > +   unsigned max_width = MIN2(32, exec_size);
> > > >  
> > > >     /* According to the PRMs:
> > > >      *  "A. In Direct Addressing mode, a source cannot span more than 2
> > > > @@ -4656,6 +4663,10 @@ get_fpu_lowered_simd_width(const struct gen_device_info *devinfo,
> > > >           max_width = MIN2(max_width, channels_per_grf);
> > > >     }
> > > >  
> > > > +   /* If we have duplicated exec_size, then readjust max_width if required. */
> > > > +   if (exec_size != inst->exec_size && max_width == exec_size)
> > > > +      max_width = inst->exec_size;
> > > > +
> > > >     /* Only power-of-two execution sizes are representable in the instruction
> > > >      * control fields.
> > > >      */
> > > > -- 
> > > > 2.9.3
> > > >