On Aug 12, 2015 4:41 AM, "Francisco Jerez" <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> wrote: > > Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> writes: > > > Neil Roberts <<a href="mailto:neil@linux.intel.com">neil@linux.intel.com</a>> writes: > > > >> Ok, that makes sense, thanks for the detailed explanation. I think > >> you're right, given the extra restrictions avoiding the integer dword > >> multiplication seems like the simplest solution. > >> > >> I can't say that I fully understand what the documentation is trying to > >> say here. Does it mean that the horizontal strides must be set such that > >> each element in the source is offset to the same qword within its > >> register as the corresponding element in the destination? > >> > > > > The way I understand it one's required to: > > - Align the stride in bytes (i.e. horizontal stride times the size of > > the register type) between components of an individual source or > > destination to a multiple of 64b (if the register region is more than > > a single scalar that is). The multiple must be the same for all > > non-scalar sources and the destination. > > - Use the same sub-register offsets for all non-scalar sources and the > > destination. > > - Set vertical stride to horizontal stride times width. > > - Don't use indirect addressing or ARF registers. > > > >> I tried changing the test case¹ so that one of the sources comes from a > >> uniform. That would make the register for the uniform source be a scalar > >> and thus all of its elements would come from the first qword, whereas > >> for the destination the elements would be written to different qwords. > >> That ends up with an instruction like this: > >> > >> mul(16) g120<1>D g9<8,8,1>D g6<0,1,0>D { align1 > >> 1H compacted }; > >> > > > > Yeah, this should definitely violate the alignment rule on both the > > destination and first source. > > > >> However, the test still passes on my BXT so I've probably misunderstood > >> something. On the other hand this does look similar to the stride you > >> had in your example instruction that the simulator complained about. > >> > >> Given that the documentation and simulator implies this shouldn't work > >> I'm happy to leave this patch and I'm not suggesting we do anything now. > >> > > > > Honestly I'm also not 100% sure if this restriction really applies to > > all BXT hardware or if it's specific to some preproduction steppings -- > > The documentation seems to imply the former, but I guess it could be > > wrong. > > > I forgot to mention that when I tried this yesterday the simulator > seemed to complain about the alignment rule regardless of the stepping > for all the ones it recognised (a-c IIRC). Have you actually seen this fail on hardware? I'm guessing no since BXT is still hard to come by. If it works on silicone, I'd be tempted to leave it alone. I guess it does make it a bit difficult to debug other things if the simulator is complaining about the multiply instruction. --Jason > >> Regards, > >> - Neil > >> > >> 1. <a href="https://github.com/bpeel/piglit/blob/test-integer-multiply-uniform/tests/general/mult32.c">https://github.com/bpeel/piglit/blob/test-integer-multiply-uniform/tests/general/mult32.c</a> > >> > >> Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> writes: > >> > >>> Neil Roberts <<a href="mailto:neil@linux.intel.com">neil@linux.intel.com</a>> writes: > >>> > >>>> Are you sure this patch is necessary? The documentation for the multiply > >>>> instruction on BDW+ says: > >>>> > >>>> SourceType : *D > >>>> DestinationType : *D > >>>> Project : EXCLUDE(CHV) > >>>> > >>>> This to me implies that it should work on BXT because it doesn't say > >>>> EXCLUDE(BXT). > >>>> > >>> > >>> In fact both CHV and BXT *can* support 32x32 multiplication, that's not > >>> really the reason that motivated this patch -- The problem is that 32x32 > >>> multiplication has QWORD execution type which has several restrictions > >>> on CHV and BXT, the annoying one is: > >>> > >>> "CHV, BXT > >>> > >>> When source or destination datatype is 64b or operation is integer DWord > >>> multiply, regioning in Align1 must follow these rules: > >>> > >>> Source and Destination horizontal stride must be aligned to the same > >>> qword. > >>> > >>> Example: > >>> [...] > >>> // mul (4) r10.0<2>:d r11.0<8;4,2>:d r12.0<8;4,2>:d // Source and Destination stride must be 2 since the execution type is Qword." > >>> > >>> So it sounds like we could use the native 32x32 multiply with some > >>> additional effort to pass the arguments with the correct stride, but > >>> it's not clear to me whether the solution would be any more better than > >>> splitting up the multiply into 32x16 multiplies, so doing the same as in > >>> CHV and pre-Gen8 seemed like the most KISS solution for now. > >>> > >>>> I made a little test case and ran it on my BXT and it seems to work even > >>>> without this patch. I looked at the assembler source and it is > >>>> definitely doing a MUL instruction with D types for the dst and two > >>>> sources. > >>>> > >>>> <a href="https://github.com/bpeel/piglit/blob/test-integer-multiply/tests/general/mult32.c">https://github.com/bpeel/piglit/blob/test-integer-multiply/tests/general/mult32.c</a> > >>>> > >>> Hmm, I guess the docs could be wrong, but I'm not sure what the > >>> consequences would be of violating the alignment rule, I guess the > >>> failure may be non-deterministic. What stepping of BXT did you test > >>> this on? > >>> > >>>> Regards, > >>>> - Neil > >>>> > >>>> Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> writes: > >>>> > >>>>> AFAIK BXT has the same annoying alignment limitation as CHV on the > >>>>> source register regions of 32x32 bit MULs, give it the same treatment. > >>>>> --- > >>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 4 ++-- > >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp > >>>>> index 244f299..fc9f007 100644 > >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > >>>>> @@ -3126,9 +3126,9 @@ fs_visitor::lower_integer_multiplication() > >>>>> bool progress = false; > >>>>> > >>>>> /* Gen8's MUL instruction can do a 32-bit x 32-bit -> 32-bit operation > >>>>> - * directly, but Cherryview cannot. > >>>>> + * directly, but CHV/BXT cannot. > >>>>> */ > >>>>> - if (devinfo->gen >= 8 && !devinfo->is_cherryview) > >>>>> + if (devinfo->gen >= 8 && !devinfo->is_cherryview && !devinfo->is_broxton) > >>>>> return false; > >>>>> > >>>>> foreach_block_and_inst_safe(block, fs_inst, inst, cfg) { > >>>>> -- > >>>>> 2.4.6 > >>>>> > >>>>> _______________________________________________ > >>>>> mesa-dev mailing list > >>>>> <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a> > >>>>> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> > > _______________________________________________ > mesa-dev mailing list > <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a> > <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> >