[Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

Fri Dec 14 07:47:23 UTC 2018

On Thu, 2018-12-13 at 12:49 +0200, Pohjolainen, Topi wrote:
> On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
> > On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
> > > On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> > > > On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > > > > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> > > > > wrote:
> > > > > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral
> > > > > > Quiroga
> > > > > > wrote:
> > > > > > > This function is used in two different scenarios that for
> > > > > > > 32-
> > > > > > > bit
> > > > > > > instructions are the same, but for 16-bit instructions
> > > > > > > are
> > > > > > > not.
> > > > > > > 
> > > > > > > One scenario is that in which we are working at a SIMD8
> > > > > > > register
> > > > > > > level and we need to know if a register is fully defined
> > > > > > > or
> > > > > > > written.
> > > > > > > This is useful, for example, in the context of liveness
> > > > > > > analysis
> > > > > > > or
> > > > > > > register allocation, where we work with units of
> > > > > > > registers.
> > > > > > > 
> > > > > > > The other scenario is that in which we want to know if an
> > > > > > > instruction
> > > > > > > is writing a full scalar component or just some subset of
> > > > > > > it.
> > > > > > > This is
> > > > > > > useful, for example, in the context of some optimization
> > > > > > > passes
> > > > > > > like copy propagation.
> > > > > > > 
> > > > > > > For 32-bit instructions (or larger), a SIMD8 dispatch
> > > > > > > will
> > > > > > > always
> > > > > > > write
> > > > > > > at least a full SIMD8 register (32B) if the write is not
> > > > > > > partial.
> > > > > > > The
> > > > > > > function is_partial_write() checks this to determine if
> > > > > > > we
> > > > > > > have a
> > > > > > > partial
> > > > > > > write. However, when we deal with 16-bit instructions,
> > > > > > > that
> > > > > > > logic
> > > > > > > disables
> > > > > > > some optimizations that should be safe. For example, a
> > > > > > > SIMD8
> > > > > > > 16-
> > > > > > > bit MOV will
> > > > > > > only update half of a SIMD register, but it is still a
> > > > > > > complete
> > > > > > > write of the
> > > > > > > variable for a SIMD8 dispatch, so we should not prevent
> > > > > > > copy
> > > > > > > propagation in
> > > > > > > this scenario because we don't write all 32 bytes in the
> > > > > > > SIMD
> > > > > > > register
> > > > > > > or because the write starts at offset 16B (wehere we pack
> > > > > > > components Y or
> > > > > > > W of 16-bit vectors).
> > > > > > > 
> > > > > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS)
> > > > > > > of
> > > > > > > 16-
> > > > > > > bit
> > > > > > > instructions, which lose a number of optimizations
> > > > > > > because of
> > > > > > > this, most
> > > > > > > important of which is copy-propagation.
> > > > > > > 
> > > > > > > This patch splits is_partial_write() into
> > > > > > > is_partial_reg_write(),
> > > > > > > which
> > > > > > > represents the current is_partial_write(), useful for
> > > > > > > things
> > > > > > > like
> > > > > > > liveness analysis, and is_partial_var_write(), which
> > > > > > > considers
> > > > > > > the dispatch size to check if we are writing a full
> > > > > > > variable
> > > > > > > (rather
> > > > > > > than a full register) to decide if the write is partial
> > > > > > > or
> > > > > > > not,
> > > > > > > which
> > > > > > > is what we really want in many optimization passes.
> > > > > 
> > > > > I actually started wondering why would liveness analysis and
> > > > > register
> > > > > coalescing need to treat the 16-bit SIMD8 case differently
> > > > > than
> > > > > optimizations.
> > > > > In virtual register space nothing would read or write the
> > > > > unused
> > > > > second half
> > > > > of the register in case of 16-bit type and SIMD8.
> > > > 
> > > > True, we might be able to use the "variable" version in more
> > > > cases.
> > > > I
> > > > was trying to be conservative when I implemented this because I
> > > > don't
> > > > think the half-float CTS tests provides a good testing ground
> > > > for
> > > > all
> > > > aspects of the compiler. I can try that and see if it breaks
> > > > anything
> > > > though.
> > > >  
> > > > > Real register allocation in turn should be orthogonal to how
> > > > > things
> > > > > are
> > > > > allocated in virtual space. And I guess even there we
> > > > > wouldn't be
> > > > > interested
> > > > > of packing two 16-bit typed SIMD8 variables into one and same
> > > > > hardware
> > > > > register. It is SIMD16 where we get more pressure into
> > > > > register
> > > > > space
> > > > > anyway
> > > > > and there the 16-bit typed variables occupy full registers.
> > > > > In
> > > > > other
> > > > > words,
> > > > > if things fit in SIMD16, would we bother packing things more
> > > > > tightly
> > > > > in
> > > > > SIMD8? Or even if SIMD8 was the only option, would we be
> > > > > interested
> > > > > packing
> > > > > channels for two variables in one hw reg even then?
> > > > > 
> > > > > Jason, we discussed this a little in the spring time.
> > > > > 
> > > > > As a recap my approach shortly. Instead of ignoring the
> > > > > second
> > > > > half
> > > > > of
> > > > > registers case by case I addressed it more generally:
> > > > > 
> > > > > - changed all the open coded checks to use helpers,
> > > > > - added a padding bit into fs_reg telling about the unused
> > > > > space,
> > > > > - change nir -> fs step to set that bit for 16-bit typed regs
> > > > > - and finally changed the helpers to consider the padding
> > > > > bit.
> > > > 
> > > > So if I understand how this works, you mostly make the vgrf
> > > > infrastructure think that half-float registers actually use
> > > > twice
> > > > the
> > > > space they require by including the padding into the
> > > > component_size()
> > > > helper, right? (We also need to cover 8-bit by the way, but I
> > > > suppose
> > > > it just means increasing the padding we use for 8-bit
> > > > variables).
> > > > And
> > > > then you modify is_partial_write() to use that helper so it
> > > > accounts
> > > > for that padding and just check if the (padded) size is less
> > > > than a
> > > > full SIMD register. Basically, we are making all smaller types
> > > > look
> > > > like they use 32-bit per channel in the vgrf space.
> > > 
> > > Actually not 32-bit per channel but that there is unused data
> > > towards
> > > the
> > > end of the register, cut-paste from the diff:
> > > 
> > > diff --git a/src/intel/compiler/brw_ir_fs.h
> > > b/src/intel/compiler/brw_ir_fs.h
> > > index 02c91b246a3..a1add69a319 100644
> > > --- a/src/intel/compiler/brw_ir_fs.h
> > > +++ b/src/intel/compiler/brw_ir_fs.h
> > > @@ -52,6 +52,9 @@ public:
> > >  
> > >     /** Register region horizontal stride */
> > >     uint8_t stride;
> > > +
> > > +   /* Needed, for example, for SIMD8 half float payloads. */
> > > +   uint8_t pad_per_component;
> > >  };
> > >  
> > >  static inline fs_reg
> > > 
> > > 
> > > I should start by the history. Payloads for sampler messages,
> > > sampler
> > > return
> > > data and data port writes operate with full registers even in
> > > case of
> > > SIMD8.
> > > For example, in case of data port write, each component has all
> > > the
> > > channels
> > > packed in the beginning of the reg. The second half is simply
> > > ignored
> > > by the
> > > hardware. The hardware just expects each component to be aligned
> > > on
> > > reg
> > > boundary. The same applies for 16-bit sampler message
> > > coordinates,
> > > each
> > > component takes full reg and only the first half contains
> > > meaningful
> > > data.
> > > And when the hardware returns 16-bit typed data from the sampler
> > > in
> > > SIMD8 the
> > > data is once again padded. 
> > > This is where the notion of padding originally comes from.
> > > 
> > > But then I started thinking why stop at payloads only - we could
> > > extend the
> > > same for all 16-bit typed variables in SIMD8. I thought that if
> > > the
> > > hardware
> > > wants to use padding for things such as sampler data coming in
> > > and
> > > pixel data
> > > going out then why should we use tighter packing for anything
> > > else -
> > > at least
> > > not in the virtual register space.
> > > 
> > > And as answer to below, the padding is only used in SIMD8 mode,
> > > SIMD16 doesn't
> > > need any of that, for 16-bit types, I mean.
> > 
> > Ok, so you just add the padding to the total size and only for
> > SIMD8
> > dispatches of 16-bit.
> > 
> > I guess we also need to account for horizontal strides right? For
> > example, a SIMD8 half-float register with a stride of 2 uses a full
> > register and should not have any padding in that case. We will have
> > this for half-float when there are alignment restrictions (for
> > conversions, etc). We also have this for pretty much all 8-bit
> > instructions, since there is a 16-bit alignment requierement for
> > pretty
> > much everything in that case.
> > 
> > Also, what would happen if someone  changes the stride of the
> > register
> > (or retypes it), would the size calculations adapt the padding
> > correspondingly and take it into account automatically?
> 
> Well, this is just my way of seeing things. I thought that marking
> padding
> explicitly in the fs_reg makes all sorts of asserts and debugging
> clearer
> as the padding tells explicitly that at least the allocator of the
> reg meant
> that only the first half is ever written (and hence only the first
> half should
> be read). Now, the logic that adds workarounds for special cases,
> such as the
> stride you mentioned, needs to modify the register (at least the
> stride) for
> producer and every consumer. Now that logic would additionally need
> to reset
> the padding as it doesn't apply anymore. Adding an assert allowing
> only
> padding or stride != 1 but not both sounds like a thing to add on
> top.

I don't know how I feel about this. I mean, we manipulate strides often
in the backend and having to account for this every time we do it, just
for 8-bit and 16-bit registers and only for SIMD8 dispatches sounds
like a big price to pay to me. I guess that most of the time we
manipulate strides through helpers such as horiz_stride(), subscript()
and maybe others so we can try to add that logic in all these places,
but there are other places where we manipulate the stride fields
directly, and I imagine that this is something that people working in
the backend expect to be able to do freely. To me it feels like an
important restriction to always have to think about these things when
you manipulate strides... I agree that having asserts will help
detecting cases where we do things like this without attending to the
padding requirements, but people would still be annoyed when they hit
them I guess, and I figure that at least for FS-specifc paths it is
still relatively easy to miss the asserts since they would probably
only activate for SIMD8 executions which will be less common than
SIMD16.

And this is not only for strides, we would also need to consider the
implications for retypes to types of different bit-sizes, etc

Anyway, I am sorry if I am sounding too negative about this at the
moment, I just can't help but to feal uneasy about the potential
implications at the moment. Maybe I just need to see how this all works
on top of all the 8-bit and 16-bit stuff we are doing...

> > 
> > I guess that in general I have two types of concerns; one is
> > whether
> > this is robust enough: on the one hand people can still try to open
> > code size calculations in future work and they are very likely to
> > not
> > going get it right for SIMD8 with non-32 bit types if they do this.
> > I
> > guess this can be fixed with very careful reviews about these
> > things in
> > the future but I think this is going to be easy to miss. The other
> > one
> > is that as I mention above, we would be returning fake (padded)
> > sizes
> > and then using these sizes for things where we are expecting the
> > real
> > size, such as when checking for hardware restrictions, etc I am
> > less
> > worried about this issue though, since as long as we keep this for
> > SIMD8 only I don't think that using padded sizes instead of real
> > sizes
> > would usually lead to different behaviors, as we usually only run
> > into
> > limitations when things exceed the size of a SIMD register, but I
> > am
> > not sure if this is always true.
> > 
> > > Now, I haven't though about 8-bit types. At least there aren't
> > > hardware
> > > supported payloads for those, right?
> > 
> > We need to do UBO load and SSBO load/store with 8-bit types. Chema
> > implemented that so maybe he can share more details about what is
> > needed here and if there are any particular restrictions or not.
> 
> I had a quick look and initially it seems to me that those payloads
> are quite
> different from sampler/fb write. They are rather fixed to 32-bit
> slots and
> instead of any padding one interleaves things. I'll try to study this
> a little
> more...
> 
> > 
> > > > 
> > > > One thing that worries me a bit is that we are faking the real
> > > > size
> > > > of
> > > > things and that can lead to parts of the compiler that rely on
> > > > these
> > > > helpers to provide the real size to do things we don't want.
> > > > For
> > > > example: are you limiting this to SIMD8 only? I ask because if
> > > > we
> > > > do
> > > > this for SIMD16 16-bit dispatches then we'd computing a size of
> > > > two
> > > > registers for things that only really use one, and I don't
> > > > think we
> > > > want that since that would have actual implications like SIMD-
> > > > width
> > > > lowering kicking in for things it should not need to. There are
> > > > a
> > > > lot
> > > > of uses of the component_size helper in other parts of the
> > > > compiler,
> > > > including the generator, where we want to know the real size of
> > > > things
> > > > because they use these sizes to check hardware restrictions so
> > > > we
> > > > want
> > > > to know the real size without padding...
> > > > 
> > > > Also, I presume that all this is orthogonal to actual register
> > > > allocation and that you are still storing the Y and W
> > > > components of
> > > > SIMD8 16-bit vectors in the second halves of registers, right?
> > > > 
> > > > > Now, I'm fine doing this case by case the way it is done
> > > > > here.
> > > > > I'm
> > > > > just
> > > > > wondering if the split is needed, i.e., considering in some
> > > > > cases
> > > > > 16-
> > > > > bit SIMD8
> > > > > virtual registers as half written and in some cases just
> > > > > ignoring
> > > > > the
> > > > > fact.
> > > > > > > Then the patch goes on and rewrites all uses of
> > > > > > > is_partial_write() to use
> > > > > > > one or the other version. Specifically, we use
> > > > > > > is_partial_var_write()
> > > > > > > in the following places: copy propagation, cmod
> > > > > > > propagation,
> > > > > > > common
> > > > > > > subexpression elimination, saturate propagation and sel
> > > > > > > peephole.
> > > > > > > 
> > > > > > > Notice that the semantics of is_partial_var_write()
> > > > > > > exactly
> > > > > > > match
> > > > > > > the
> > > > > > > current implementation of is_partial_write() for anything
> > > > > > > that is
> > > > > > > 32-bit or larger, so no changes are expected for 32-bit
> > > > > > > instructions.
> > > > > > > 
> > > > > > > Tested against ~5000 tests involving 16-bit instructions
> > > > > > > in
> > > > > > > CTS
> > > > > > > produced
> > > > > > > the following changes in instruction counts:
> > > > > > > 
> > > > > > >             Patched  |     Master    |    %    |
> > > > > > > ================================================
> > > > > > > SIMD8  |    621,900  |    706,721    | -12.00% |
> > > > > > > ================================================
> > > > > > > SIMD16 |     93,252  |     93,252    |   0.00% |
> > > > > > > ================================================
> > > > > > > 
> > > > > > > As expected, the change only affects SIMD8 dispatches.
> > > > > > 
> > > > > > I like this. But I think I want to try and rebase my fp16
> > > > > > work
> > > > > > on
> > > > > > top to see
> > > > > > if there are any differences in the final assembly between
> > > > > > this
> > > > > > and
> > > > > > my
> > > > > > "register padding" scheme.
> > > > > > 
> > > > > > > ---
> > > > > > >  src/intel/compiler/brw_fs.cpp                 | 31
> > > > > > > +++++++++++++++----
> > > > > > >  .../compiler/brw_fs_cmod_propagation.cpp      | 20
> > > > > > > ++++++---
> > > > > > > ---
> > > > > > >  .../compiler/brw_fs_copy_propagation.cpp      |  8 ++---
> > > > > > >  src/intel/compiler/brw_fs_cse.cpp             |  3 +-
> > > > > > >  .../compiler/brw_fs_dead_code_eliminate.cpp   |  2 +-
> > > > > > >  src/intel/compiler/brw_fs_live_variables.cpp  |  2 +-
> > > > > > >  src/intel/compiler/brw_fs_reg_allocate.cpp    |  2 +-
> > > > > > >  .../compiler/brw_fs_register_coalesce.cpp     |  2 +-
> > > > > > >  .../compiler/brw_fs_saturate_propagation.cpp  |  7 +++--
> > > > > > >  src/intel/compiler/brw_fs_sel_peephole.cpp    |  4 +--
> > > > > > >  src/intel/compiler/brw_ir_fs.h                |  3 +-
> > > > > > >  11 files changed, 54 insertions(+), 30 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/src/intel/compiler/brw_fs.cpp
> > > > > > > b/src/intel/compiler/brw_fs.cpp
> > > > > > > index 1d5d1dd0d22..9ea67975e1e 100644
> > > > > > > --- a/src/intel/compiler/brw_fs.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs.cpp
> > > > > > > @@ -698,14 +698,33 @@
> > > > > > > fs_visitor::limit_dispatch_width(unsigned
> > > > > > > n, const char *msg)
> > > > > > >   * it.
> > > > > > >   */
> > > > > > >  bool
> > > > > > > -fs_inst::is_partial_write() const
> > > > > > > +fs_inst::is_partial_reg_write() const
> > > > > > >  {
> > > > > > >     return ((this->predicate && this->opcode !=
> > > > > > > BRW_OPCODE_SEL)
> > > > > > > > > 
> > > > > > > 
> > > > > > > -           (this->exec_size * type_sz(this->dst.type)) <
> > > > > > > 32
> > > > > > > > > 
> > > > > > > 
> > > > > > >             !this->dst.is_contiguous() ||
> > > > > > > +           (this->exec_size * type_sz(this->dst.type)) <
> > > > > > > REG_SIZE ||
> > > > > > >             this->dst.offset % REG_SIZE != 0);
> > > > > > >  }
> > > > > > >  
> > > > > > > +/**
> > > > > > > + * Returns true if the instruction has a flag that means
> > > > > > > it
> > > > > > > won't
> > > > > > > + * update an entire variable for the given dispatch
> > > > > > > width.
> > > > > > > + *
> > > > > > > + * This is only different from is_partial_reg_write()
> > > > > > > for
> > > > > > > SIMD8
> > > > > > > + * dispatches of 16-bit (or smaller) instructions.
> > > > > > > + */
> > > > > > > +bool
> > > > > > > +fs_inst::is_partial_var_write(uint32_t dispatch_width)
> > > > > > > const
> > > > > > > +{
> > > > > > > +   const uint32_t type_size = type_sz(this->dst.type);
> > > > > > > +   uint32_t var_size = MIN2(REG_SIZE, dispatch_width *
> > > > > > > type_size);
> > > > > > > +
> > > > > > > +   return ((this->predicate && this->opcode !=
> > > > > > > BRW_OPCODE_SEL)
> > > > > > > > > 
> > > > > > > 
> > > > > > > +           !this->dst.is_contiguous() ||
> > > > > > > +           (this->exec_size * type_sz(this->dst.type)) <
> > > > > > > var_size ||
> > > > > > > +           this->dst.offset % var_size != 0);
> > > > > > > +}
> > > > > > > +
> > > > > > >  unsigned
> > > > > > >  fs_inst::components_read(unsigned i) const
> > > > > > >  {
> > > > > > > @@ -2847,7 +2866,7 @@ fs_visitor::opt_register_renaming()
> > > > > > >        if (depth == 0 &&
> > > > > > >            inst->dst.file == VGRF &&
> > > > > > >            alloc.sizes[inst->dst.nr] * REG_SIZE == inst-
> > > > > > > > size_written &&
> > > > > > > 
> > > > > > > -          !inst->is_partial_write()) {
> > > > > > > +          !inst->is_partial_reg_write()) {
> > > > > > >           if (remap[dst] == -1) {
> > > > > > >              remap[dst] = dst;
> > > > > > >           } else {
> > > > > > > @@ -3050,7 +3069,7 @@ fs_visitor::compute_to_mrf()
> > > > > > >        next_ip++;
> > > > > > >  
> > > > > > >        if (inst->opcode != BRW_OPCODE_MOV ||
> > > > > > > -	  inst->is_partial_write() ||
> > > > > > > +	  inst->is_partial_reg_write() ||
> > > > > > >  	  inst->dst.file != MRF || inst->src[0].file != VGRF ||
> > > > > > >  	  inst->dst.type != inst->src[0].type ||
> > > > > > >  	  inst->src[0].abs || inst->src[0].negate ||
> > > > > > > @@ -3083,7 +3102,7 @@ fs_visitor::compute_to_mrf()
> > > > > > >  	     * that writes that reg, but it would require
> > > > > > > smarter
> > > > > > >  	     * tracking.
> > > > > > >  	     */
> > > > > > > -	    if (scan_inst->is_partial_write())
> > > > > > > +	    if (scan_inst->is_partial_reg_write())
> > > > > > >  	       break;
> > > > > > >  
> > > > > > >              /* Handling things not fully contained in
> > > > > > > the
> > > > > > > source
> > > > > > > of the copy
> > > > > > > @@ -3395,7 +3414,7 @@
> > > > > > > fs_visitor::remove_duplicate_mrf_writes()
> > > > > > >        if (inst->opcode == BRW_OPCODE_MOV &&
> > > > > > >  	  inst->dst.file == MRF &&
> > > > > > >  	  inst->src[0].file != ARF &&
> > > > > > > -	  !inst->is_partial_write()) {
> > > > > > > +	  !inst->is_partial_reg_write()) {
> > > > > > >           last_mrf_move[inst->dst.nr] = inst;
> > > > > > >        }
> > > > > > >     }
> > > > > > > diff --git
> > > > > > > a/src/intel/compiler/brw_fs_cmod_propagation.cpp
> > > > > > > b/src/intel/compiler/brw_fs_cmod_propagation.cpp
> > > > > > > index 5fb522f810f..7bb5c9afbc9 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_cmod_propagation.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_cmod_propagation.cpp
> > > > > > > @@ -50,13 +50,13 @@
> > > > > > >  
> > > > > > >  static bool
> > > > > > >  cmod_propagate_cmp_to_add(const gen_device_info
> > > > > > > *devinfo,
> > > > > > > bblock_t *block,
> > > > > > > -                          fs_inst *inst)
> > > > > > > +                          fs_inst *inst, unsigned
> > > > > > > dispatch_width)
> > > > > > >  {
> > > > > > >     bool read_flag = false;
> > > > > > >  
> > > > > > >     foreach_inst_in_block_reverse_starting_from(fs_inst,
> > > > > > > scan_inst, inst) {
> > > > > > >        if (scan_inst->opcode == BRW_OPCODE_ADD &&
> > > > > > > -          !scan_inst->is_partial_write() &&
> > > > > > > +          !scan_inst-
> > > > > > > >is_partial_var_write(dispatch_width)
> > > > > > > &&
> > > > > > >            scan_inst->exec_size == inst->exec_size) {
> > > > > > >           bool negate;
> > > > > > >  
> > > > > > > @@ -126,7 +126,7 @@ cmod_propagate_cmp_to_add(const
> > > > > > > gen_device_info *devinfo, bblock_t *block,
> > > > > > >   */
> > > > > > >  static bool
> > > > > > >  cmod_propagate_not(const gen_device_info *devinfo,
> > > > > > > bblock_t
> > > > > > > *block,
> > > > > > > -                   fs_inst *inst)
> > > > > > > +                   fs_inst *inst, unsigned
> > > > > > > dispatch_width)
> > > > > > >  {
> > > > > > >     const enum brw_conditional_mod cond =
> > > > > > > brw_negate_cmod(inst-
> > > > > > > > conditional_mod);
> > > > > > > 
> > > > > > >     bool read_flag = false;
> > > > > > > @@ -141,7 +141,7 @@ cmod_propagate_not(const
> > > > > > > gen_device_info
> > > > > > > *devinfo, bblock_t *block,
> > > > > > >               scan_inst->opcode != BRW_OPCODE_AND)
> > > > > > >              break;
> > > > > > >  
> > > > > > > -         if (scan_inst->is_partial_write() ||
> > > > > > > +         if (scan_inst-
> > > > > > > >is_partial_var_write(dispatch_width) 
> > > > > > > > > 
> > > > > > > 
> > > > > > >               scan_inst->dst.offset != inst-
> > > > > > > >src[0].offset ||
> > > > > > >               scan_inst->exec_size != inst->exec_size)
> > > > > > >              break;
> > > > > > > @@ -166,7 +166,9 @@ cmod_propagate_not(const
> > > > > > > gen_device_info
> > > > > > > *devinfo, bblock_t *block,
> > > > > > >  }
> > > > > > >  
> > > > > > >  static bool
> > > > > > > -opt_cmod_propagation_local(const gen_device_info
> > > > > > > *devinfo,
> > > > > > > bblock_t *block)
> > > > > > > +opt_cmod_propagation_local(const gen_device_info
> > > > > > > *devinfo,
> > > > > > > +                           bblock_t *block,
> > > > > > > +                           unsigned dispatch_width)
> > > > > > >  {
> > > > > > >     bool progress = false;
> > > > > > >     int ip = block->end_ip + 1;
> > > > > > > @@ -219,14 +221,14 @@ opt_cmod_propagation_local(const
> > > > > > > gen_device_info *devinfo, bblock_t *block)
> > > > > > >         */
> > > > > > >        if (inst->opcode == BRW_OPCODE_CMP && !inst-
> > > > > > > > src[1].is_zero()) {
> > > > > > > 
> > > > > > >           if (brw_reg_type_is_floating_point(inst-
> > > > > > > > src[0].type)
> > > > > > > 
> > > > > > > &&
> > > > > > > -             cmod_propagate_cmp_to_add(devinfo, block,
> > > > > > > inst))
> > > > > > > +             cmod_propagate_cmp_to_add(devinfo, block,
> > > > > > > inst,
> > > > > > > dispatch_width))
> > > > > > >              progress = true;
> > > > > > >  
> > > > > > >           continue;
> > > > > > >        }
> > > > > > >  
> > > > > > >        if (inst->opcode == BRW_OPCODE_NOT) {
> > > > > > > -         progress = cmod_propagate_not(devinfo, block,
> > > > > > > inst)
> > > > > > > > > 
> > > > > > > 
> > > > > > > progress;
> > > > > > > +         progress = cmod_propagate_not(devinfo, block,
> > > > > > > inst,
> > > > > > > dispatch_width) || progress;
> > > > > > >           continue;
> > > > > > >        }
> > > > > > >  
> > > > > > > @@ -234,7 +236,7 @@ opt_cmod_propagation_local(const
> > > > > > > gen_device_info *devinfo, bblock_t *block)
> > > > > > >        foreach_inst_in_block_reverse_starting_from(fs_ins
> > > > > > > t,
> > > > > > > scan_inst, inst) {
> > > > > > >           if (regions_overlap(scan_inst->dst, scan_inst-
> > > > > > > > size_written,
> > > > > > > 
> > > > > > >                               inst->src[0], inst-
> > > > > > > > size_read(0)))
> > > > > > > 
> > > > > > > {
> > > > > > > -            if (scan_inst->is_partial_write() ||
> > > > > > > +            if (scan_inst-
> > > > > > > > is_partial_var_write(dispatch_width)
> > > > > > > > > 
> > > > > > > 
> > > > > > >                  scan_inst->dst.offset != inst-
> > > > > > > >src[0].offset 
> > > > > > > > > 
> > > > > > > 
> > > > > > >                  scan_inst->exec_size != inst->exec_size)
> > > > > > >                 break;
> > > > > > > @@ -342,7 +344,7 @@ fs_visitor::opt_cmod_propagation()
> > > > > > >     bool progress = false;
> > > > > > >  
> > > > > > >     foreach_block_reverse(block, cfg) {
> > > > > > > -      progress = opt_cmod_propagation_local(devinfo,
> > > > > > > block)
> > > > > > > > > 
> > > > > > > 
> > > > > > > progress;
> > > > > > > +      progress = opt_cmod_propagation_local(devinfo,
> > > > > > > block,
> > > > > > > dispatch_width) || progress;
> > > > > > >     }
> > > > > > >  
> > > > > > >     if (progress)
> > > > > > > diff --git
> > > > > > > a/src/intel/compiler/brw_fs_copy_propagation.cpp
> > > > > > > b/src/intel/compiler/brw_fs_copy_propagation.cpp
> > > > > > > index c01d4ec4a4f..0f0284115fb 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_copy_propagation.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
> > > > > > > @@ -505,7 +505,7 @@
> > > > > > > fs_visitor::try_copy_propagate(fs_inst
> > > > > > > *inst,
> > > > > > > int arg, acp_entry *entry)
> > > > > > >     /* Compute the first component of the copy that the
> > > > > > > instruction is
> > > > > > >      * reading, and the base byte offset within that
> > > > > > > component.
> > > > > > >      */
> > > > > > > -   assert(entry->dst.offset % REG_SIZE == 0 && entry-
> > > > > > > > dst.stride 
> > > > > > > 
> > > > > > > == 1);
> > > > > > > +   assert(entry->dst.stride == 1);
> > > > > > >     const unsigned component = rel_offset /
> > > > > > > type_sz(entry-
> > > > > > > > dst.type);
> > > > > > > 
> > > > > > >     const unsigned suboffset = rel_offset %
> > > > > > > type_sz(entry-
> > > > > > > > dst.type);
> > > > > > > 
> > > > > > >  
> > > > > > > @@ -783,7 +783,7 @@
> > > > > > > fs_visitor::try_constant_propagate(fs_inst
> > > > > > > *inst, acp_entry *entry)
> > > > > > >  }
> > > > > > >  
> > > > > > >  static bool
> > > > > > > -can_propagate_from(fs_inst *inst)
> > > > > > > +can_propagate_from(fs_inst *inst, unsigned
> > > > > > > dispatch_width)
> > > > > > >  {
> > > > > > >     return (inst->opcode == BRW_OPCODE_MOV &&
> > > > > > >             inst->dst.file == VGRF &&
> > > > > > > @@ -794,7 +794,7 @@ can_propagate_from(fs_inst *inst)
> > > > > > >              inst->src[0].file == UNIFORM ||
> > > > > > >              inst->src[0].file == IMM) &&
> > > > > > >             inst->src[0].type == inst->dst.type &&
> > > > > > > -           !inst->is_partial_write());
> > > > > > > +           !inst->is_partial_var_write(dispatch_width));
> > > > > > >  }
> > > > > > >  
> > > > > > >  /* Walks a basic block and does copy propagation on it
> > > > > > > using
> > > > > > > the
> > > > > > > acp
> > > > > > > @@ -846,7 +846,7 @@
> > > > > > > fs_visitor::opt_copy_propagation_local(void
> > > > > > > *copy_prop_ctx, bblock_t *block,
> > > > > > >        /* If this instruction's source could potentially
> > > > > > > be
> > > > > > > folded into the
> > > > > > >         * operand of another instruction, add it to the
> > > > > > > ACP.
> > > > > > >         */
> > > > > > > -      if (can_propagate_from(inst)) {
> > > > > > > +      if (can_propagate_from(inst, dispatch_width)) {
> > > > > > >           acp_entry *entry = ralloc(copy_prop_ctx,
> > > > > > > acp_entry);
> > > > > > >           entry->dst = inst->dst;
> > > > > > >           entry->src = inst->src[0];
> > > > > > > diff --git a/src/intel/compiler/brw_fs_cse.cpp
> > > > > > > b/src/intel/compiler/brw_fs_cse.cpp
> > > > > > > index 6859733d58c..56813df2d2a 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_cse.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_cse.cpp
> > > > > > > @@ -247,7 +247,8 @@ fs_visitor::opt_cse_local(bblock_t
> > > > > > > *block)
> > > > > > >     int ip = block->start_ip;
> > > > > > >     foreach_inst_in_block(fs_inst, inst, block) {
> > > > > > >        /* Skip some cases. */
> > > > > > > -      if (is_expression(this, inst) && !inst-
> > > > > > > > is_partial_write() 
> > > > > > > 
> > > > > > > &&
> > > > > > > +      if (is_expression(this, inst) &&
> > > > > > > +          !inst->is_partial_var_write(dispatch_width) &&
> > > > > > >            ((inst->dst.file != ARF && inst->dst.file !=
> > > > > > > FIXED_GRF) ||
> > > > > > >             inst->dst.is_null()))
> > > > > > >        {
> > > > > > > diff --git
> > > > > > > a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
> > > > > > > b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
> > > > > > > index eeb71dd2b92..f24fd0643b8 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
> > > > > > > @@ -110,7 +110,7 @@ fs_visitor::dead_code_eliminate()
> > > > > > >           }
> > > > > > >  
> > > > > > >           if (inst->dst.file == VGRF) {
> > > > > > > -            if (!inst->is_partial_write()) {
> > > > > > > +            if (!inst->is_partial_reg_write()) {
> > > > > > >                 int var = live_intervals-
> > > > > > > >var_from_reg(inst-
> > > > > > > > dst);
> > > > > > > 
> > > > > > >                 for (unsigned i = 0; i <
> > > > > > > regs_written(inst);
> > > > > > > i++)
> > > > > > > {
> > > > > > >                    BITSET_CLEAR(live, var + i);
> > > > > > > diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
> > > > > > > b/src/intel/compiler/brw_fs_live_variables.cpp
> > > > > > > index 059f076fa51..30625aa586a 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_live_variables.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_live_variables.cpp
> > > > > > > @@ -84,7 +84,7 @@
> > > > > > > fs_live_variables::setup_one_write(struct
> > > > > > > block_data *bd, fs_inst *inst,
> > > > > > >      * screens off previous updates of that variable
> > > > > > > (VGRF
> > > > > > > channel).
> > > > > > >      */
> > > > > > >     if (inst->dst.file == VGRF) {
> > > > > > > -      if (!inst->is_partial_write() && !BITSET_TEST(bd-
> > > > > > > >use,
> > > > > > > var))
> > > > > > > +      if (!inst->is_partial_reg_write() &&
> > > > > > > !BITSET_TEST(bd-
> > > > > > > > use, 
> > > > > > > 
> > > > > > > var))
> > > > > > >           BITSET_SET(bd->def, var);
> > > > > > >  
> > > > > > >        BITSET_SET(bd->defout, var);
> > > > > > > diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp
> > > > > > > b/src/intel/compiler/brw_fs_reg_allocate.cpp
> > > > > > > index 73b8b7841f5..5147e8b4426 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_reg_allocate.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
> > > > > > > @@ -1026,7 +1026,7 @@ fs_visitor::spill_reg(int
> > > > > > > spill_reg)
> > > > > > >            * write, there should be no need for the
> > > > > > > unspill
> > > > > > > since
> > > > > > > the
> > > > > > >            * instruction will be overwriting the whole
> > > > > > > destination in any case.
> > > > > > >  	  */
> > > > > > > -         if (inst->is_partial_write() ||
> > > > > > > +         if (inst->is_partial_reg_write() ||
> > > > > > >               (!inst->force_writemask_all &&
> > > > > > > !per_channel))
> > > > > > >              emit_unspill(ubld, spill_src,
> > > > > > > subset_spill_offset,
> > > > > > >                           regs_written(inst));
> > > > > > > diff --git
> > > > > > > a/src/intel/compiler/brw_fs_register_coalesce.cpp
> > > > > > > b/src/intel/compiler/brw_fs_register_coalesce.cpp
> > > > > > > index 952276faed8..b27956023c6 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_register_coalesce.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_register_coalesce.cpp
> > > > > > > @@ -70,7 +70,7 @@ is_coalesce_candidate(const fs_visitor
> > > > > > > *v,
> > > > > > > const fs_inst *inst)
> > > > > > >  {
> > > > > > >     if ((inst->opcode != BRW_OPCODE_MOV &&
> > > > > > >          inst->opcode != SHADER_OPCODE_LOAD_PAYLOAD) ||
> > > > > > > -       inst->is_partial_write() ||
> > > > > > > +       inst->is_partial_reg_write() ||
> > > > > > >         inst->saturate ||
> > > > > > >         inst->src[0].file != VGRF ||
> > > > > > >         inst->src[0].negate ||
> > > > > > > diff --git
> > > > > > > a/src/intel/compiler/brw_fs_saturate_propagation.cpp
> > > > > > > b/src/intel/compiler/brw_fs_saturate_propagation.cpp
> > > > > > > index d6cfa79a618..1e1461063ae 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_saturate_propagation.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_saturate_propagation.cpp
> > > > > > > @@ -43,7 +43,8 @@
> > > > > > >   */
> > > > > > >  
> > > > > > >  static bool
> > > > > > > -opt_saturate_propagation_local(fs_visitor *v, bblock_t
> > > > > > > *block)
> > > > > > > +opt_saturate_propagation_local(fs_visitor *v, bblock_t
> > > > > > > *block,
> > > > > > > +                               unsigned dispatch_width)
> > > > > > >  {
> > > > > > >     bool progress = false;
> > > > > > >     int ip = block->end_ip + 1;
> > > > > > > @@ -66,7 +67,7 @@
> > > > > > > opt_saturate_propagation_local(fs_visitor
> > > > > > > *v,
> > > > > > > bblock_t *block)
> > > > > > >        foreach_inst_in_block_reverse_starting_from(fs_ins
> > > > > > > t,
> > > > > > > scan_inst, inst) {
> > > > > > >           if (regions_overlap(scan_inst->dst, scan_inst-
> > > > > > > > size_written,
> > > > > > > 
> > > > > > >                               inst->src[0], inst-
> > > > > > > > size_read(0)))
> > > > > > > 
> > > > > > > {
> > > > > > > -            if (scan_inst->is_partial_write() ||
> > > > > > > +            if (scan_inst-
> > > > > > > > is_partial_var_write(dispatch_width)
> > > > > > > > > 
> > > > > > > 
> > > > > > >                  (scan_inst->dst.type != inst->dst.type
> > > > > > > &&
> > > > > > >                   !scan_inst->can_change_types()))
> > > > > > >                 break;
> > > > > > > @@ -153,7 +154,7 @@
> > > > > > > fs_visitor::opt_saturate_propagation()
> > > > > > >     calculate_live_intervals();
> > > > > > >  
> > > > > > >     foreach_block (block, cfg) {
> > > > > > > -      progress = opt_saturate_propagation_local(this,
> > > > > > > block)
> > > > > > > > > 
> > > > > > > 
> > > > > > > progress;
> > > > > > > +      progress = opt_saturate_propagation_local(this,
> > > > > > > block,
> > > > > > > dispatch_width) || progress;
> > > > > > >     }
> > > > > > >  
> > > > > > >     /* Live intervals are still valid. */
> > > > > > > diff --git a/src/intel/compiler/brw_fs_sel_peephole.cpp
> > > > > > > b/src/intel/compiler/brw_fs_sel_peephole.cpp
> > > > > > > index 6395b409b7c..98d640a3bfe 100644
> > > > > > > --- a/src/intel/compiler/brw_fs_sel_peephole.cpp
> > > > > > > +++ b/src/intel/compiler/brw_fs_sel_peephole.cpp
> > > > > > > @@ -167,8 +167,8 @@ fs_visitor::opt_peephole_sel()
> > > > > > >               then_mov[i]->exec_size != else_mov[i]-
> > > > > > > > exec_size ||
> > > > > > > 
> > > > > > >               then_mov[i]->group != else_mov[i]->group ||
> > > > > > >               then_mov[i]->force_writemask_all !=
> > > > > > > else_mov[i]-
> > > > > > > > force_writemask_all ||
> > > > > > > 
> > > > > > > -             then_mov[i]->is_partial_write() ||
> > > > > > > -             else_mov[i]->is_partial_write() ||
> > > > > > > +             then_mov[i]-
> > > > > > > > is_partial_var_write(dispatch_width)
> > > > > > > > > 
> > > > > > > 
> > > > > > > +             else_mov[i]-
> > > > > > > > is_partial_var_write(dispatch_width)
> > > > > > > > > 
> > > > > > > 
> > > > > > >               then_mov[i]->conditional_mod !=
> > > > > > > BRW_CONDITIONAL_NONE ||
> > > > > > >               else_mov[i]->conditional_mod !=
> > > > > > > BRW_CONDITIONAL_NONE) {
> > > > > > >              movs = i;
> > > > > > > diff --git a/src/intel/compiler/brw_ir_fs.h
> > > > > > > b/src/intel/compiler/brw_ir_fs.h
> > > > > > > index 07e7224e0f8..02c91b246a3 100644
> > > > > > > --- a/src/intel/compiler/brw_ir_fs.h
> > > > > > > +++ b/src/intel/compiler/brw_ir_fs.h
> > > > > > > @@ -349,7 +349,8 @@ public:
> > > > > > >  
> > > > > > >     bool equals(fs_inst *inst) const;
> > > > > > >     bool is_send_from_grf() const;
> > > > > > > -   bool is_partial_write() const;
> > > > > > > +   bool is_partial_reg_write() const;
> > > > > > > +   bool is_partial_var_write(unsigned dispatch_width)
> > > > > > > const;
> > > > > > >     bool is_copy_payload(const brw::simple_allocator
> > > > > > > &grf_alloc)
> > > > > > > const;
> > > > > > >     unsigned components_read(unsigned i) const;
> > > > > > >     unsigned size_read(int arg) const;
> > > > > > > -- 
> > > > > > > 2.17.1
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > mesa-dev mailing list
> > > > > > > mesa-dev at lists.freedesktop.org
> > > > > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > > > > 
> > > > > 
> > > 
> > > 
> 
>