[Mesa-stable] [PATCH v2 32/53] intel/fs: Mark LINTERP opcode as writing accumulator on platforms without PLN

Tue May 29 15:30:39 UTC 2018

On Mon, May 28, 2018 at 10:50 AM, Francisco Jerez <currojerez at riseup.net>
wrote:

> Jason Ekstrand <jason at jlekstrand.net> writes:
>
> > On Sat, May 26, 2018 at 6:33 PM, Francisco Jerez <currojerez at riseup.net>
> > wrote:
> >
> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >>
> >> > On Sat, May 26, 2018 at 3:36 PM, Francisco Jerez <
> currojerez at riseup.net>
> >> > wrote:
> >> >
> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >> >>
> >> >> > On Sat, May 26, 2018 at 2:03 PM, Francisco Jerez <
> >> currojerez at riseup.net>
> >> >> > wrote:
> >> >> >
> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >> >> >>
> >> >> >> > On Sat, May 26, 2018 at 12:21 PM, Francisco Jerez <
> >> >> currojerez at riseup.net
> >> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >> >> >> >>
> >> >> >> >> > On Sat, May 26, 2018 at 11:10 AM, Francisco Jerez <
> >> >> >> currojerez at riseup.net
> >> >> >> >> >
> >> >> >> >> > wrote:
> >> >> >> >> >
> >> >> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >> >> >> >> >>
> >> >> >> >> >> > From: Francisco Jerez <currojerez at riseup.net>
> >> >> >> >> >> >
> >> >> >> >> >> > When we don't have PLN (gen4 and gen11+), we implement
> >> LINTERP
> >> >> as
> >> >> >> >> either
> >> >> >> >> >> > LINE+MAC or a pair of MADs.  In both cases, the
> accumulator
> >> is
> >> >> >> written
> >> >> >> >> >> > by the first of the two instructions and read by the
> second.
> >> >> Even
> >> >> >> >> >> > though the accumulator value isn't actually ever used
> from a
> >> >> >> logical
> >> >> >> >> >> > instruction perspective, it is trashed so we need to make
> the
> >> >> >> >> scheduler
> >> >> >> >> >> > aware.  Otherwise, the scheduler could end up re-ordering
> >> >> >> instructions
> >> >> >> >> >> > and putting a LINTERP between another an instruction which
> >> >> writes
> >> >> >> the
> >> >> >> >> >> > accumulator and another which tries to use that result.
> >> >> >> >> >> >
> >> >> >> >> >> > Cc: mesa-stable at lists.freedesktop.org
> >> >> >> >> >> > ---
> >> >> >> >> >> >  src/intel/compiler/brw_shader.cpp | 3 ++-
> >> >> >> >> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >> >> >> >> >
> >> >> >> >> >> > diff --git a/src/intel/compiler/brw_shader.cpp
> >> >> >> >> b/src/intel/compiler/brw_
> >> >> >> >> >> shader.cpp
> >> >> >> >> >> > index 141b64e..dfd2c5c 100644
> >> >> >> >> >> > --- a/src/intel/compiler/brw_shader.cpp
> >> >> >> >> >> > +++ b/src/intel/compiler/brw_shader.cpp
> >> >> >> >> >> > @@ -984,7 +984,8 @@ backend_instruction::writes_
> >> >> >> >> accumulator_implicitly(const
> >> >> >> >> >> struct gen_device_info
> >> >> >> >> >> >     return writes_accumulator ||
> >> >> >> >> >> >            (devinfo->gen < 6 &&
> >> >> >> >> >> >             ((opcode >= BRW_OPCODE_ADD && opcode <
> >> >> BRW_OPCODE_NOP)
> >> >> >> ||
> >> >> >> >> >> > -            (opcode >= FS_OPCODE_DDX_COARSE && opcode <=
> >> >> >> >> >> FS_OPCODE_LINTERP)));
> >> >> >> >> >> > +            (opcode >= FS_OPCODE_DDX_COARSE && opcode <=
> >> >> >> >> >> FS_OPCODE_LINTERP))) ||
> >> >> >> >> >> > +          (opcode == FS_OPCODE_LINTERP &&
> >> !devinfo->has_pln);
> >> >> >> >> >>
> >> >> >> >> >> The devinfo->has_pln condition is technically inaccurate,
> >> because
> >> >> >> even
> >> >> >> >> >> SNB will fall back to the non-PLN path which overwrites the
> >> >> >> accumulator
> >> >> >> >> >> for certain valid IR, which yeah I'm aware is not
> *typically*
> >> >> >> >> >> encountered before this series,
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > I'm pretty sure it's impossible before this series because,
> >> without
> >> >> >> >> SIMD32,
> >> >> >> >> > the barycentric coordinates always start at g2 and get
> >> incremented
> >> >> by
> >> >> >> 2
> >> >> >> >> > every time.  The only other way to get something into the
> >> >> coordinates
> >> >> >> >> > source of the PLN is with a pixel interpolator message.  For
> >> that,
> >> >> the
> >> >> >> >> > register allocator has a workaround to ensure that it's
> >> assigned an
> >> >> >> even
> >> >> >> >> > register on SNB.  One of the early patches in my series
> replaces
> >> >> the
> >> >> >> >> broken
> >> >> >> >> > gen4.5-6 PLN fall-back (it didn't work in SIMD16 because it
> >> assumed
> >> >> >> the
> >> >> >> >> > wrong register layout for coordinates) with an assert and
> >> Jenkins
> >> >> is
> >> >> >> just
> >> >> >> >> > fine with the assert.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> I know the conditions for the non-PLN fall-back are not
> typically
> >> >> >> >> encountered on Gen5-6, but it's still valid IR, so this
> >> >> implementation
> >> >> >> >> of writes_accumulator_implicitly() relies on the behavior of
> the
> >> >> >> >> register allocator, the NIR-to-i965 translation pass and the
> rest
> >> of
> >> >> the
> >> >> >> >> visitor being exactly what you expect in every possible
> codepath
> >> for
> >> >> >> >> correctness.
> >> >> >> >
> >> >> >> >
> >> >> >> > It's already relied on for correctness because the LINE+MAC
> >> fallback
> >> >> that
> >> >> >> > we had before is wrong in SIMD16 (see patch 33).
> >> >> >> >
> >> >> >>
> >> >> >> Yes, I had to do basically the same fix on my SIMD32 branch in
> order
> >> for
> >> >> >> it to pass piglit on Gen4-5.  Still hardly an excuse to make more
> >> code
> >> >> >> rely on the same broken assumption which wasn't doing it before.
> >>
> >
> > I think this is the fundamental point of disagreement.  What I would
> argue
> > (and what this series does) is that it's not possible to hit odd register
> > case and, if we do, so badly broken on g4x-SNB (uses the wrong register
> > layout in SIMD16 and doesn't actually bother to write the accumulator on
> > gen6) that we're better off if we just stop pretending it works.  Patch
> > 1/53 of this series just removes the fall-back and replaces it with an
> > assert at which point the odd register fall-back path does not exist.
> > Jenkins is happy with us doing so.  This patch then fixes
> > writes_accumulator_implicitly for the !has_pln case and the next patch
> > makes re-implements the the odd register case correctly because it really
> > is needed for SIMD32.
> >
>
> You're reiterating the same point.  The content of the paragraph above
> was clear four e-mails ago.  It still doesn't contain any convincing
> argument to me to split an obviously correct change into a buggy one
> followed by a bugfix.
>

Would it make you happier if things were re-ordered such that we drop patch
1, do the correct implementation of LINE+MAC first and then follow it with
a version of this patch that includes all the cases?

> > If you're really concerned about the odd register case on the stable
> > branch, I'm happy to back port all three patches.  It won't do any good
> but
> > it also won't cause any harm either.
> >
>
> I wasn't hugely concerned about the odd register case being hit on the
> stable branch (even though it passing Jenkins doesn't guarantee it will
> never be hit for some codepaths in the compiler which happen not to be
> hit by any test-case on the CI), I was only complaining about the
> seemingly idiotic structure of patches you had sent with my name.
>
> >>> > I'm very confused by what you mean here.  The old code assumed the
> >> >> LINE+MAC
> >> >> > layout register layout which is different from the PLN layout.
> This
> >> >> means
> >> >> > that the fallback is broken for g45-gen6 because the hardware
> supports
> >> >> PLN
> >> >> > so we use the PLN layout which arranges thing x1y1x2y2 but the
> >> LINE+MAC
> >> >> > fallback assumes x1x2y1y2.  The reason why we need so much more
> code
> >> is
> >> >> > because we have to split it all the way down to 8-wide in order to
> >> handle
> >> >> > the PLN layout with LINE+MAC.  Looking at the code in your branch,
> I
> >> have
> >> >> > no idea how it could possibly work correctly.
> >> >> >
> >> >>
> >> >> It worked correctly with the LINE/MAC path (and in fact it passed
> piglit
> >> >> with SIMD32 enabled on Gen4-5) in exactly the same way PATCH 33 of
> this
> >> >> series does, but it used a loop instead of the 3x hand-unrolled code
> of
> >> >> PATCH 33, and it took advantage of the two accumulator registers for
> >> >> better pipelining.
> >> >
> >> >
> >> > I dug around a bit more in your your branch and I think I know what's
> >> going
> >> > on now.  In your patch "i965/fs: Fix LINTERP instruction codegen for
> >> > misaligned regs on pre-Gen7 hw", you changed the LINE+MAC emit code to
> >> use
> >> > a loop to generate 1, 2, or 4 8-wide instructions.  This implicitly
> >> > switches the assumed layout of barycentric coordinates in the
> fall-back
> >> > code to the PLN convention (x1y1x2y2) and breaks SIMD16 on gen4 which
> >> lays
> >> > the barycentric coordinates out as x1x2y1y2.  In "i965/fs: Fix Gen4-5
> >> > interpolation setup for SIMD32", you silently change the barycentric
> >> > coordinate layout convention on gen4 to be the PLN convention and it
> gets
> >> > fixed again.
> >>
> >> Yes, that's right, it relies on the other patch that makes the payload
> >> barycentric layout consistent across generations.  And yes, I didn't
> >> spend any effort documenting the changes nor ordering things in a way
> >> that would allow bisection, since that branch wasn't meant to be
> >> reviewed nor merged upstream yet.
> >>
> >
> > Ok, at least we're now on the same page about how various branches work.
> >
> >
> >> > I'm also very confused about accumulators since the SNB documentation
> >> > says that we have 16 floats worth (two accumulators) and your branch
> >> > definitely does 4 8-wide LINE instructions followed by 4 8-wide MAC
> >> > instructions.  I agree that taking advantage of the two accumulators
> >> > would be good, but the patch I see in your branch seems to be
> >> > incorrect.
> >> >
> >> No, my branch wasn't doing 4 8-wide instructions in a row, that would
> >> require four accumulators -- It was relying on SIMD lowering to split
> >> things up into 16-wide LINTERP instructions, just like your patch is.
> >>
> >
> > Ok, I see that now.  Also, I just realized how creepy it is to have the
> > SIMD lowering pass lower LINTERP.  It's not going to do it correctly
> > because it doesn't understand the channel layout of the payload when it's
> > set up for PLN.  We get by because none of those MOVs ever happen in the
> > end.
> >
>
> Actually no, that was working fine.  My branch addressed the problem by
> laying out SIMD32 payloads according to the inverse of the permutation
> that emit_unzip() will apply on it subsequently, like "x0y0 x2y2 x1y1
> x3y3", which after unzip gives you "x0y0 x1y1" for the first half and
> "x2y2 x3y3" for the second.  The MOVs should always be emitted by the
> SIMD lowering pass for those because they're multiple-component source
> regions, but they should be optimized out.
>

Right.  My comment there was that the MOVs will be emitted with the wrong
channel enables.  We'll likely move x0y0 with one SIMD16 MOV and x1y1 with
another.  Since the optimizer gets rid of the MOVs, it's not a problem.

> >
> >> > All in all, changing the gen4 barycentric coordinate convention
> doesn't
> >> > seem like all that bad of a thing.  It's a bit odd that we lay them
> out
> >> > differently on exactly one platform.  I suppose, though, that we
> might be
> >> > able to generate SIMD16 MADs on gen11 if we started using the LINE
> >> > convention again.  Not sure if it's worth it.
> >> >
> >> >
> >> >> I can send you the patch as a replacement for PATCH
> >> >> 33 if you care at all about performance of this corner case on
> Gen4-6...
> >> >>
> >> >> > My point is that the odd register fallback on g4x-gen6 was never
> used
> >> and
> >> >> > didn't work even if it was.
> >> >> >
> >> >>
> >> >> My point was that it doesn't matter for correctness of this patch
> >> >> whether another codepath in the compiler is buggy under approximately
> >> >> the same conditions.  If you think there is some sort of benefit from
> >> >> committing knowingly broken code just so you can fix it in the next
> >> >> commit, feel free, but please drop my name from the author tag since
> I
> >> >> didn't intend it to.
> >>
> >
> > Done.
> >
> >
> >> >> > --Jason
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> >> That wasn't the case in my original patch, nor in your
> >> >> >> >> series after PATCHv2 33 because this inaccuracy actually
> becomes a
> >> >> >> >> problem.  Instead of introducing code which we know is
> dubiously
> >> >> >> >> correct, CC'ing mesa-stable, and then fixing it immediately,
> why
> >> >> don't
> >> >> >> >> we just do the obviously correct thing from the start?
> >> >> >> >>
> >> >> >> >
> >> >> >> > Sure, we can squash them together if you really want.  But if
> >> you're
> >> >> >> > concerned about this restriction being valid in live code, then
> >> 33/53
> >> >> >> also
> >> >> >> > needs to go to stable.  I'm fine with that if you're unconvinced
> >> by my
> >> >> >> > argument that LINTERP with an odd coordinate never occurs.
> >> >> >> >
> >> >> >> > --Jason
> >> >> >> >
> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >> but why make things more inaccurate than
> >> >> >> >> >> the original only to revert back to a devinfo->gen based
> check
> >> in
> >> >> >> >> >> PATCHv2 33?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > See above.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >> I think I'd squash the last hunk of PATCHv2 33 into this one
> >> which
> >> >> >> would
> >> >> >> >> >> give you something functionally equivalent to v1 but
> updated to
> >> >> >> handle
> >> >> >> >> >> Gen11 correctly.
> >> >> >> >> >>
> >> >> >> >> >> >  }
> >> >> >> >> >> >
> >> >> >> >> >> >  bool
> >> >> >> >> >> > --
> >> >> >> >> >> > 2.5.0.400.gff86faf
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-stable/attachments/20180529/9a95664d/attachment-0001.html>