[Mesa-dev] [PATCH v2 32/53] intel/fs: Mark LINTERP opcode as writing accumulator on platforms without PLN

Tue May 29 16:59:55 UTC 2018

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Mon, May 28, 2018 at 10:50 AM, Francisco Jerez <currojerez at riseup.net>
> wrote:
>
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>> > On Sat, May 26, 2018 at 6:33 PM, Francisco Jerez <currojerez at riseup.net>
>> > wrote:
>> >
>> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >>
>> >> > On Sat, May 26, 2018 at 3:36 PM, Francisco Jerez <
>> currojerez at riseup.net>
>> >> > wrote:
>> >> >
>> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >> >>
>> >> >> > On Sat, May 26, 2018 at 2:03 PM, Francisco Jerez <
>> >> currojerez at riseup.net>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >> >> >>
>> >> >> >> > On Sat, May 26, 2018 at 12:21 PM, Francisco Jerez <
>> >> >> currojerez at riseup.net
>> >> >> >> >
>> >> >> >> > wrote:
>> >> >> >> >
>> >> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >> >> >> >>
>> >> >> >> >> > On Sat, May 26, 2018 at 11:10 AM, Francisco Jerez <
>> >> >> >> currojerez at riseup.net
>> >> >> >> >> >
>> >> >> >> >> > wrote:
>> >> >> >> >> >
>> >> >> >> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > From: Francisco Jerez <currojerez at riseup.net>
>> >> >> >> >> >> >
>> >> >> >> >> >> > When we don't have PLN (gen4 and gen11+), we implement
>> >> LINTERP
>> >> >> as
>> >> >> >> >> either
>> >> >> >> >> >> > LINE+MAC or a pair of MADs.  In both cases, the
>> accumulator
>> >> is
>> >> >> >> written
>> >> >> >> >> >> > by the first of the two instructions and read by the
>> second.
>> >> >> Even
>> >> >> >> >> >> > though the accumulator value isn't actually ever used
>> from a
>> >> >> >> logical
>> >> >> >> >> >> > instruction perspective, it is trashed so we need to make
>> the
>> >> >> >> >> scheduler
>> >> >> >> >> >> > aware.  Otherwise, the scheduler could end up re-ordering
>> >> >> >> instructions
>> >> >> >> >> >> > and putting a LINTERP between another an instruction which
>> >> >> writes
>> >> >> >> the
>> >> >> >> >> >> > accumulator and another which tries to use that result.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Cc: mesa-stable at lists.freedesktop.org
>> >> >> >> >> >> > ---
>> >> >> >> >> >> >  src/intel/compiler/brw_shader.cpp | 3 ++-
>> >> >> >> >> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >> >> >> >> >
>> >> >> >> >> >> > diff --git a/src/intel/compiler/brw_shader.cpp
>> >> >> >> >> b/src/intel/compiler/brw_
>> >> >> >> >> >> shader.cpp
>> >> >> >> >> >> > index 141b64e..dfd2c5c 100644
>> >> >> >> >> >> > --- a/src/intel/compiler/brw_shader.cpp
>> >> >> >> >> >> > +++ b/src/intel/compiler/brw_shader.cpp
>> >> >> >> >> >> > @@ -984,7 +984,8 @@ backend_instruction::writes_
>> >> >> >> >> accumulator_implicitly(const
>> >> >> >> >> >> struct gen_device_info
>> >> >> >> >> >> >     return writes_accumulator ||
>> >> >> >> >> >> >            (devinfo->gen < 6 &&
>> >> >> >> >> >> >             ((opcode >= BRW_OPCODE_ADD && opcode <
>> >> >> BRW_OPCODE_NOP)
>> >> >> >> ||
>> >> >> >> >> >> > -            (opcode >= FS_OPCODE_DDX_COARSE && opcode <=
>> >> >> >> >> >> FS_OPCODE_LINTERP)));
>> >> >> >> >> >> > +            (opcode >= FS_OPCODE_DDX_COARSE && opcode <=
>> >> >> >> >> >> FS_OPCODE_LINTERP))) ||
>> >> >> >> >> >> > +          (opcode == FS_OPCODE_LINTERP &&
>> >> !devinfo->has_pln);
>> >> >> >> >> >>
>> >> >> >> >> >> The devinfo->has_pln condition is technically inaccurate,
>> >> because
>> >> >> >> even
>> >> >> >> >> >> SNB will fall back to the non-PLN path which overwrites the
>> >> >> >> accumulator
>> >> >> >> >> >> for certain valid IR, which yeah I'm aware is not
>> *typically*
>> >> >> >> >> >> encountered before this series,
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > I'm pretty sure it's impossible before this series because,
>> >> without
>> >> >> >> >> SIMD32,
>> >> >> >> >> > the barycentric coordinates always start at g2 and get
>> >> incremented
>> >> >> by
>> >> >> >> 2
>> >> >> >> >> > every time.  The only other way to get something into the
>> >> >> coordinates
>> >> >> >> >> > source of the PLN is with a pixel interpolator message.  For
>> >> that,
>> >> >> the
>> >> >> >> >> > register allocator has a workaround to ensure that it's
>> >> assigned an
>> >> >> >> even
>> >> >> >> >> > register on SNB.  One of the early patches in my series
>> replaces
>> >> >> the
>> >> >> >> >> broken
>> >> >> >> >> > gen4.5-6 PLN fall-back (it didn't work in SIMD16 because it
>> >> assumed
>> >> >> >> the
>> >> >> >> >> > wrong register layout for coordinates) with an assert and
>> >> Jenkins
>> >> >> is
>> >> >> >> just
>> >> >> >> >> > fine with the assert.
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> I know the conditions for the non-PLN fall-back are not
>> typically
>> >> >> >> >> encountered on Gen5-6, but it's still valid IR, so this
>> >> >> implementation
>> >> >> >> >> of writes_accumulator_implicitly() relies on the behavior of
>> the
>> >> >> >> >> register allocator, the NIR-to-i965 translation pass and the
>> rest
>> >> of
>> >> >> the
>> >> >> >> >> visitor being exactly what you expect in every possible
>> codepath
>> >> for
>> >> >> >> >> correctness.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > It's already relied on for correctness because the LINE+MAC
>> >> fallback
>> >> >> that
>> >> >> >> > we had before is wrong in SIMD16 (see patch 33).
>> >> >> >> >
>> >> >> >>
>> >> >> >> Yes, I had to do basically the same fix on my SIMD32 branch in
>> order
>> >> for
>> >> >> >> it to pass piglit on Gen4-5.  Still hardly an excuse to make more
>> >> code
>> >> >> >> rely on the same broken assumption which wasn't doing it before.
>> >>
>> >
>> > I think this is the fundamental point of disagreement.  What I would
>> argue
>> > (and what this series does) is that it's not possible to hit odd register
>> > case and, if we do, so badly broken on g4x-SNB (uses the wrong register
>> > layout in SIMD16 and doesn't actually bother to write the accumulator on
>> > gen6) that we're better off if we just stop pretending it works.  Patch
>> > 1/53 of this series just removes the fall-back and replaces it with an
>> > assert at which point the odd register fall-back path does not exist.
>> > Jenkins is happy with us doing so.  This patch then fixes
>> > writes_accumulator_implicitly for the !has_pln case and the next patch
>> > makes re-implements the the odd register case correctly because it really
>> > is needed for SIMD32.
>> >
>>
>> You're reiterating the same point.  The content of the paragraph above
>> was clear four e-mails ago.  It still doesn't contain any convincing
>> argument to me to split an obviously correct change into a buggy one
>> followed by a bugfix.
>>
>
> Would it make you happier if things were re-ordered such that we drop patch
> 1, do the correct implementation of LINE+MAC first and then follow it with
> a version of this patch that includes all the cases?
>

I don't care about how you order the patches.

>
>> > If you're really concerned about the odd register case on the stable
>> > branch, I'm happy to back port all three patches.  It won't do any good
>> but
>> > it also won't cause any harm either.
>> >
>>
>> I wasn't hugely concerned about the odd register case being hit on the
>> stable branch (even though it passing Jenkins doesn't guarantee it will
>> never be hit for some codepaths in the compiler which happen not to be
>> hit by any test-case on the CI), I was only complaining about the
>> seemingly idiotic structure of patches you had sent with my name.
>>
>> >>> > I'm very confused by what you mean here.  The old code assumed the
>> >> >> LINE+MAC
>> >> >> > layout register layout which is different from the PLN layout.
>> This
>> >> >> means
>> >> >> > that the fallback is broken for g45-gen6 because the hardware
>> supports
>> >> >> PLN
>> >> >> > so we use the PLN layout which arranges thing x1y1x2y2 but the
>> >> LINE+MAC
>> >> >> > fallback assumes x1x2y1y2.  The reason why we need so much more
>> code
>> >> is
>> >> >> > because we have to split it all the way down to 8-wide in order to
>> >> handle
>> >> >> > the PLN layout with LINE+MAC.  Looking at the code in your branch,
>> I
>> >> have
>> >> >> > no idea how it could possibly work correctly.
>> >> >> >
>> >> >>
>> >> >> It worked correctly with the LINE/MAC path (and in fact it passed
>> piglit
>> >> >> with SIMD32 enabled on Gen4-5) in exactly the same way PATCH 33 of
>> this
>> >> >> series does, but it used a loop instead of the 3x hand-unrolled code
>> of
>> >> >> PATCH 33, and it took advantage of the two accumulator registers for
>> >> >> better pipelining.
>> >> >
>> >> >
>> >> > I dug around a bit more in your your branch and I think I know what's
>> >> going
>> >> > on now.  In your patch "i965/fs: Fix LINTERP instruction codegen for
>> >> > misaligned regs on pre-Gen7 hw", you changed the LINE+MAC emit code to
>> >> use
>> >> > a loop to generate 1, 2, or 4 8-wide instructions.  This implicitly
>> >> > switches the assumed layout of barycentric coordinates in the
>> fall-back
>> >> > code to the PLN convention (x1y1x2y2) and breaks SIMD16 on gen4 which
>> >> lays
>> >> > the barycentric coordinates out as x1x2y1y2.  In "i965/fs: Fix Gen4-5
>> >> > interpolation setup for SIMD32", you silently change the barycentric
>> >> > coordinate layout convention on gen4 to be the PLN convention and it
>> gets
>> >> > fixed again.
>> >>
>> >> Yes, that's right, it relies on the other patch that makes the payload
>> >> barycentric layout consistent across generations.  And yes, I didn't
>> >> spend any effort documenting the changes nor ordering things in a way
>> >> that would allow bisection, since that branch wasn't meant to be
>> >> reviewed nor merged upstream yet.
>> >>
>> >
>> > Ok, at least we're now on the same page about how various branches work.
>> >
>> >
>> >> > I'm also very confused about accumulators since the SNB documentation
>> >> > says that we have 16 floats worth (two accumulators) and your branch
>> >> > definitely does 4 8-wide LINE instructions followed by 4 8-wide MAC
>> >> > instructions.  I agree that taking advantage of the two accumulators
>> >> > would be good, but the patch I see in your branch seems to be
>> >> > incorrect.
>> >> >
>> >> No, my branch wasn't doing 4 8-wide instructions in a row, that would
>> >> require four accumulators -- It was relying on SIMD lowering to split
>> >> things up into 16-wide LINTERP instructions, just like your patch is.
>> >>
>> >
>> > Ok, I see that now.  Also, I just realized how creepy it is to have the
>> > SIMD lowering pass lower LINTERP.  It's not going to do it correctly
>> > because it doesn't understand the channel layout of the payload when it's
>> > set up for PLN.  We get by because none of those MOVs ever happen in the
>> > end.
>> >
>>
>> Actually no, that was working fine.  My branch addressed the problem by
>> laying out SIMD32 payloads according to the inverse of the permutation
>> that emit_unzip() will apply on it subsequently, like "x0y0 x2y2 x1y1
>> x3y3", which after unzip gives you "x0y0 x1y1" for the first half and
>> "x2y2 x3y3" for the second.  The MOVs should always be emitted by the
>> SIMD lowering pass for those because they're multiple-component source
>> regions, but they should be optimized out.
>>
>
> Right.  My comment there was that the MOVs will be emitted with the wrong
> channel enables.  We'll likely move x0y0 with one SIMD16 MOV and x1y1 with
> another.  Since the optimizer gets rid of the MOVs, it's not a problem.
>

You should probably fix that as part of this series.  exec_all() is your
friend.

>
>> >
>> >> > All in all, changing the gen4 barycentric coordinate convention
>> doesn't
>> >> > seem like all that bad of a thing.  It's a bit odd that we lay them
>> out
>> >> > differently on exactly one platform.  I suppose, though, that we
>> might be
>> >> > able to generate SIMD16 MADs on gen11 if we started using the LINE
>> >> > convention again.  Not sure if it's worth it.
>> >> >
>> >> >
>> >> >> I can send you the patch as a replacement for PATCH
>> >> >> 33 if you care at all about performance of this corner case on
>> Gen4-6...
>> >> >>
>> >> >> > My point is that the odd register fallback on g4x-gen6 was never
>> used
>> >> and
>> >> >> > didn't work even if it was.
>> >> >> >
>> >> >>
>> >> >> My point was that it doesn't matter for correctness of this patch
>> >> >> whether another codepath in the compiler is buggy under approximately
>> >> >> the same conditions.  If you think there is some sort of benefit from
>> >> >> committing knowingly broken code just so you can fix it in the next
>> >> >> commit, feel free, but please drop my name from the author tag since
>> I
>> >> >> didn't intend it to.
>> >>
>> >
>> > Done.
>> >
>> >
>> >> >> > --Jason
>> >> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> >> That wasn't the case in my original patch, nor in your
>> >> >> >> >> series after PATCHv2 33 because this inaccuracy actually
>> becomes a
>> >> >> >> >> problem.  Instead of introducing code which we know is
>> dubiously
>> >> >> >> >> correct, CC'ing mesa-stable, and then fixing it immediately,
>> why
>> >> >> don't
>> >> >> >> >> we just do the obviously correct thing from the start?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Sure, we can squash them together if you really want.  But if
>> >> you're
>> >> >> >> > concerned about this restriction being valid in live code, then
>> >> 33/53
>> >> >> >> also
>> >> >> >> > needs to go to stable.  I'm fine with that if you're unconvinced
>> >> by my
>> >> >> >> > argument that LINTERP with an odd coordinate never occurs.
>> >> >> >> >
>> >> >> >> > --Jason
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >> but why make things more inaccurate than
>> >> >> >> >> >> the original only to revert back to a devinfo->gen based
>> check
>> >> in
>> >> >> >> >> >> PATCHv2 33?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> > See above.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >> I think I'd squash the last hunk of PATCHv2 33 into this one
>> >> which
>> >> >> >> would
>> >> >> >> >> >> give you something functionally equivalent to v1 but
>> updated to
>> >> >> >> handle
>> >> >> >> >> >> Gen11 correctly.
>> >> >> >> >> >>
>> >> >> >> >> >> >  }
>> >> >> >> >> >> >
>> >> >> >> >> >> >  bool
>> >> >> >> >> >> > --
>> >> >> >> >> >> > 2.5.0.400.gff86faf
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180529/07a404aa/attachment-0001.sig>