[Mesa-dev] [PATCH 2/5] i965/fs: Lower 32x32 bit multiplication on BXT.

Wed Aug 12 04:25:21 PDT 2015

Neil Roberts <neil at linux.intel.com> writes:

> Ok, that makes sense, thanks for the detailed explanation. I think
> you're right, given the extra restrictions avoiding the integer dword
> multiplication seems like the simplest solution.
>
> I can't say that I fully understand what the documentation is trying to
> say here. Does it mean that the horizontal strides must be set such that
> each element in the source is offset to the same qword within its
> register as the corresponding element in the destination?
>

The way I understand it one's required to:
 - Align the stride in bytes (i.e. horizontal stride times the size of
   the register type) between components of an individual source or
   destination to a multiple of 64b (if the register region is more than
   a single scalar that is).  The multiple must be the same for all
   non-scalar sources and the destination.
 - Use the same sub-register offsets for all non-scalar sources and the
   destination.
 - Set vertical stride to horizontal stride times width.
 - Don't use indirect addressing or ARF registers.

> I tried changing the test case¹ so that one of the sources comes from a
> uniform. That would make the register for the uniform source be a scalar
> and thus all of its elements would come from the first qword, whereas
> for the destination the elements would be written to different qwords.
> That ends up with an instruction like this:
>
> mul(16)         g120<1>D        g9<8,8,1>D      g6<0,1,0>D { align1
>                                                             1H compacted };
>

Yeah, this should definitely violate the alignment rule on both the
destination and first source.

> However, the test still passes on my BXT so I've probably misunderstood
> something. On the other hand this does look similar to the stride you
> had in your example instruction that the simulator complained about.
>
> Given that the documentation and simulator implies this shouldn't work
> I'm happy to leave this patch and I'm not suggesting we do anything now.
>

Honestly I'm also not 100% sure if this restriction really applies to
all BXT hardware or if it's specific to some preproduction steppings --
The documentation seems to imply the former, but I guess it could be
wrong.

> Regards,
> - Neil
>
> 1. https://github.com/bpeel/piglit/blob/test-integer-multiply-uniform/tests/general/mult32.c
>
> Francisco Jerez <currojerez at riseup.net> writes:
>
>> Neil Roberts <neil at linux.intel.com> writes:
>>
>>> Are you sure this patch is necessary? The documentation for the multiply
>>> instruction on BDW+ says:
>>>
>>> SourceType : 	*D 	
>>> DestinationType : 	*D 	
>>> Project : 	EXCLUDE(CHV) 
>>>
>>> This to me implies that it should work on BXT because it doesn't say
>>> EXCLUDE(BXT).
>>>
>>
>> In fact both CHV and BXT *can* support 32x32 multiplication, that's not
>> really the reason that motivated this patch -- The problem is that 32x32
>> multiplication has QWORD execution type which has several restrictions
>> on CHV and BXT, the annoying one is:
>>
>> "CHV, BXT
>>
>> When source or destination datatype is 64b or operation is integer DWord
>> multiply, regioning in Align1 must follow these rules:
>>
>>     Source and Destination horizontal stride must be aligned to the same
>>     qword.
>>
>>     Example:
>> [...]
>>     // mul (4) r10.0<2>:d r11.0<8;4,2>:d r12.0<8;4,2>:d // Source and Destination stride must be 2 since the execution type is Qword."
>>
>> So it sounds like we could use the native 32x32 multiply with some
>> additional effort to pass the arguments with the correct stride, but
>> it's not clear to me whether the solution would be any more better than
>> splitting up the multiply into 32x16 multiplies, so doing the same as in
>> CHV and pre-Gen8 seemed like the most KISS solution for now.
>>
>>> I made a little test case and ran it on my BXT and it seems to work even
>>> without this patch. I looked at the assembler source and it is
>>> definitely doing a MUL instruction with D types for the dst and two
>>> sources.
>>>
>>> https://github.com/bpeel/piglit/blob/test-integer-multiply/tests/general/mult32.c
>>>
>> Hmm, I guess the docs could be wrong, but I'm not sure what the
>> consequences would be of violating the alignment rule, I guess the
>> failure may be non-deterministic.  What stepping of BXT did you test
>> this on?
>>
>>> Regards,
>>> - Neil
>>>
>>> Francisco Jerez <currojerez at riseup.net> writes:
>>>
>>>> AFAIK BXT has the same annoying alignment limitation as CHV on the
>>>> source register regions of 32x32 bit MULs, give it the same treatment.
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> index 244f299..fc9f007 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> @@ -3126,9 +3126,9 @@ fs_visitor::lower_integer_multiplication()
>>>>     bool progress = false;
>>>>  
>>>>     /* Gen8's MUL instruction can do a 32-bit x 32-bit -> 32-bit operation
>>>> -    * directly, but Cherryview cannot.
>>>> +    * directly, but CHV/BXT cannot.
>>>>      */
>>>> -   if (devinfo->gen >= 8 && !devinfo->is_cherryview)
>>>> +   if (devinfo->gen >= 8 && !devinfo->is_cherryview && !devinfo->is_broxton)
>>>>        return false;
>>>>  
>>>>     foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>>> -- 
>>>> 2.4.6
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150812/0525265b/attachment.sig>