[Mesa-dev] [PATCH v2 000/103] i965 Haswell ARB_gpu_shader_fp64 / OpenGL 4.0

Samuel Iglesias Gonsálvez siglesias at igalia.com
Tue Dec 13 08:01:07 UTC 2016


Hello Matt!

On Mon, 2016-12-05 at 15:21 -0800, Matt Turner wrote:
> On 10/11, Iago Toral Quiroga wrote:
> > It's been some time since
> 
> ... anyone has reviewed your patches. Sorry. :(
> 
> I'm going to review from your rebased i965-fp64-gen7-scalar-vec4-rc2
> branch. There have probably been some reorderings or other changes
> due
> to rebasing since the patches were sent, so I'm going to paste the
> list
> of patches below and then attempt to list any review comments after
> the
> patch name.
> 

Thanks a lot for the review!

> 
> A couple of patches have an extra newline in the commit message
> between
> *-by: tags. Would be nice to make a pass through and fix that.
> 
> 
> i965/nir: double/dvec2 uniforms only need to be padded to a single
> vec4 slot
> i965/vec4/nir: simplify glsl_type_for_nir_alu_type()
> i965/vec4/nir: allocate two registers for dvec3/dvec4
> i965/vec4/nir: Add bit-size information to types
> i965/vec4/nir: support doubles in ALU operations
> i965/vec4/nir: set the right type for 64-bit registers
> i965/vec4/nir: fix emitting 64-bit immediates
> i965/vec4: add support for printing DF immediates
> i965/vec4: add double/float conversion pseudo-opcodes
> 
> 	I wonder if we should allow MOV F/DF and DF/F operations in the
> 	IR and then have a lowering pass that "legalizes" them. I'm
> 	happy to leave that experiment for after this series lands.
> 

This is another way of doing it but we did not do it because we have
not seen a clear advantage on this approach.

> i965/vec4: translate d2f/f2d
> i965: add brw_vecn_grf()
> i965/vec4: set correct register regions for 32-bit and 64-bit
> i965/disasm: align16 DF source regions have a width of 2
> 
> 	It's actually kind of weird to print width and horizontal
> stride
> 	for align16 sources, since they don't exist in the instruction
> 	word. We should probably print only the vertical stride. I
> don't
> 	care if that's fixed a part of this series.
> 

Right. We updated it because it was printed but you are right.
We will do this change later in a follow-up patch.

> i965/vec4: We only support 32-bit integer ALU operations for now
> i965/vec4: add dst_null_df()
> i965/vec4: add VEC4_OPCODE_PICK_{LOW,HIGH}_32BIT opcodes
> i965/vec4: add VEC4_OPCODE_SET_{LOW,HIGH}_32BIT opcodes
> 
> 	If I understand correctly, these opcodes map to instructions
> 	like
> 
> 		mov(XXX) dst<1>UD  src<8,4,2>:UD
> 
> 	Is the exec_size 4? I ask, because if it's 8 (and the source
> 	region spans two registers and the dest region spans one)
> that's
> 	not a legal instruction. If it's 4, then it's legal.
> 

The execsize is 8 for the specific case you mention, which is emitted
in VEC4_OPCODE_PICK_{HIGH,LOW}_32BIT. When the source regions spans two
register, it is allowed that the destination region spans one
register but only in specific cases:

See HSW's PRM doc, Volume 7: 3D Media GPGPU Engine (Haswell), page
948: 

--------------------------------
A. Region Alignment Rules for Direct Register Addressing

[...]

When an instruction has a source region spanning two registers and a
destination region
contained in one register the number of elements must be the same
between two sources
and one of the following must be true:
a. The destination region is entirely contained in the lower OWord of a
register.
b. The destination region is entirely contained in the upper OWord of a
register.
c. The destination elements are evenly split between the two OWords of
a register.
--------------------------------

That mov is legal because of c) rule.

However, for VEC4_OPCODE_SET_{LOW,HIGH}_32BIT opcodes the emitted code
is slighly different:

mov(4)          dst<2>UD       src<4,4,1>UD       { align1 1N };

Where both dst and src span one register because of exec_size = 4. The
exec_size is set to 4 by the simd lowering pass because it detects a
dst that spans two registers  and a source that only spans one, which
is not allowed.

> i965/vec4: Fix DCE for VEC4_OPCODE_SET_{LOW,HIGH}_32BIT
> i965/vec4: don't copy propagate vector opcodes that operate in align1
> mode
> i965/vec4: implement double unpacking
> 
> 	This emits 
> 	
> 		MOV        dvec4_tmp, op[0]
> 		PICK_LO/HI uvec4_tmp, dvec4_tmp
> 		MOV        dst, uvec4_tmp
> 	
> 	I'm confused about the purpose of the MOVs. It seems like op[0]
> 	should already be a dvec and dst should already be a uvec.
> 

The opcodes used for this operates in align1 mode, so it can't handle
swizzles in the src nor writemasks in the dst, that's why we need the
movs, the first one handles the swizzle in the src and the second once
handles the writemask in the dst.

> i965/vec4: implement double packing
> 
> 	More or less the same thing here. Looks like we don't need all
> 	of the MOVs.
> 

Same than before.

> i965/vec4/nir: implement double comparisons
> 
> 	Trivial: A newline before the if() would be nice.
> 
> 	I have a memory of Curro telling me that the hardware maps each
> 	32-bit chunk in the dst to a single bit in the flag register.
> 	Maybe that's only on IVB, and maybe I'm misremembering. I'm
> 	concerned that while the PICK_LOW+MOV will properly handle the
> 	result that is written to the destination, the result written
> to
> 	the flag register might be incorrect.
> 
> 	My commit d9b09f8a30 fixed some problems that seems similar in
> 	my mind.
> 

As far as we know that is not what happens, and the flag register has
one bit for each logical channel (so each 64-bit chunk for DF
instructions). If that were not the case, I'd expect a lot of the tests
for doubles to fail or at least non-uniform control-flow scenarios to
fail, for which we have specific tests that are passing just fine in
both haswell and ivybridge. We will try to double-check with Curro just
in case though.

> i965/vec4: fix indentation in get_nir_src()
> i965/vec4: fix get_nir_dest() to use DF type for 64-bit destinations
> i965/vec4: make opt_vector_float ignore doubles
> i965/vec4: fix register allocation for 64-bit undef sources
> i965/vec4: Rename DF to/from F generator opcodes
> 
> 	I'm not sure replacing "float" with "single" implies that the
> 	opcodes can handle other 32-bit (integer) types, since "single"
> 	is actually the name of the "float" type in some other
> 	programming languages.
> 
> 	Maybe call them VEC4_OPCODE_TO_DOUBLE and
> 	VEC4_OPCODE_FROM_DOUBLE?
> 

OK, we are going to do it.

> i965/vec4: add helpers for conversions to/from doubles
> 
> 	Same thing here.
> 
> 	Also, same confusion about the purpose of the MOVs.
> 

OK, we will rename the helpers too.

Regarding the MOVs, it is the same than before: the opcode used for
this operates in align1 mode, so it can't handle swizzles in the src
nor writemasks in the dst, that's why we need the movs, the first one
handles the swizzle in the src and the second once handles the
writemask in the dst.

> i965/vec4: implement hardware workaround for align16 double to float
> conversion
> 
> 	This always seemed like a really strange hardware bug, and one
> 	that no one should ever hit.
> 
> 	I'd prefer that, instead of loading an immediate double and
> then
> 	performing a conversion to float, that we just convert the
> 	double to float in the compiler and emit an instruction to load
> 	that.
> 

You are right, this is a good optimization indeed. We are going to do
it.

> i965/vec4: implement d2i, d2u, i2d and u2d
> i965/vec4: implement d2b
> 
> 	Trivial: s/Curo/Curro/ in commit message.
> 
> 	Trivial: The comment says "predicated MOV", but it's actually a
> 	MOV with conditional_mod.
> 
> i965/vec4: implement fsign() for doubles
> 
> 	Trivial: v2 comment and comment in code say "predicated MOV"
> 	like previous patch
> 
> i965/vec4: fix optimize predicate for doubles
> 
> 	I guess I'm pleasantly surprised that the any/all predicates do
> 	the right thing in the presence of doubles...
> 
> 	(Are you sure? :)
> 

Well, we have not found any cases in piglit or CTS that fail here.
Following our reasoning that the flag register has 1 bit for each 64-
bit element we suppose it is not surprising that this works with
doubles...

> i965/vec4: add a helper function to create double immediates
> 
> 	Can leave for later: Shouldn't we use the DIM instruction (on
> 	HSW)?
> 
> 	I'm not sure if this should be fixed now or later, but
> shouldn't
> 	we use NibCtrl on these two instructions instead of
> 	force_writemask_all? I think this is a case where NibCtrl is
> 	useful.
> 

Yes, we can use DIM instruction here. We are going to write a follow-up 
patch for it.

Regarding the use of NibCtrl on these two instructions, it won't work:
We are copying two 32-bit values (one of the 32 LSB and another
for the 32 MSB) and NibCtrl can only be used with 64-bit data.

However, Nibctrl would be used by the instruction that uses the DF
immediate as a source.

> i965/vec4: use the new helper function to create double immediates
> i965: move exec_size from fs_instruction to backend_instruction
> i965/vec4: fix size_written for doubles
> i965/vec4: fix regs_read() for doubles
> i965/vec4: use the IR's execution size
> i965/vec4: dump the instruction execution size
> 
> 
> 
> ==== I made it to here today. ====
> 
> Anything with no or trivial feedback is
> 
> Reviewed-by: Matt Turner <mattst88 at gmail.com>
> 

Thanks a lot again!

Sam


> I will continue with these tomorrow:
> 
> i965/vec4: handle 32 and 64 bit channels in liveness analysis
> i965/vec4: add a horiz_offset() helper
> i965: move the group field from fs_inst to backend_instruction.
> i965/vec4: add a SIMD lowering pass
> i965/vec4: make the generator set correct NibCtrl for SIMD4 DF
> instructions
> i965/vec4: dump NibCtrl for instructions with execsize != 8
> i965/disasm: print NibCtrl for instructions with execsize < 8
> i965/vec4: teach CSE about exec_size, group and doubles
> i965/vec4: teach cmod propagation about different execution sizes
> i965/vec4: split double-precision bcsel
> i965/vec4: add a scalarization pass for double-precision instructions
> i965/vec4: translate 64-bit swizzles to 32-bit
> i965/vec4: implement access to DF source components Z/W
> i965/disasm: fix subreg for dst in Align16 mode
> i965/vec4: teach register coalescing about 64-bit
> i965/vec4: fix pack_uniform_registers for doubles
> i965/vec4: fix indentation in pack_uniform_registers
> i965/vec4: Skip swizzle to subnr in 3src instructions with DF
> operands
> i965/vec4/nir: do not emit 64-bit MAD
> i965/vec4: do not emit 64-bit MAD
> i965/vec4: support multiple dispatch widths and groups in the IR
> builder.
> i965/vec4: Add a shuffle_64bit_data helper
> i965/vec4: Fix UBO loads for 64-bit data
> i965/vec4: Fix SSBO loads for 64-bit data
> i965/vec4: Fix SSBO stores for 64-bit data
> i965/vec4: don't constant propagate 64-bit immediates
> i965/vec4: prevent copy-propagation from values with a different type
> size
> i965/vec4: Prevent copy propagation from violating pre-gen8
> restrictions
> i965/vec4: don't propagate single-precision uniforms into 4-wide
> instructions
> i965/vec4: don't copy propagate misaligned registers
> i965/vec4: extend the DWORD multiply DepCtrl restriction to all gen8
> platforms
> i965/vec4: Do not use DepCtrl with 64-bit instructions
> i965/vec4: do not split scratch read/write opcodes
> i965/vec4: fix scratch offset for 64bit data
> i965/vec4: fix scratch reads for 64bit data
> i965/vec4: fix scratch writes for 64bit data
> i965/vec4: fix move_uniform_array_access_to_pull_constant() for 64-
> bit data
> i965/vec4: fix indentation in move_push_constants_to_pull_constants()
> i965/vec4: fix move_push_constants_to_pull_constants() for 64-bit
> data
> i965/vec4: make emit_pull_constant_load support 64-bit loads
> i965/vec4: fix indentation in lower_attributes_to_hw_regs()
> i965/vec4: fix attribute setup for doubles
> i965/vec4: fix store output for 64-bit types
> i965/vec4/gs: fix input loading for 64bit data
> i965/vec4/tcs: fix input loading for 64-bit data
> i965/vec4/tcs: fix outputs for 64-bit data
> i965/vec4/tes: fix input loading for 64bit data types
> i965/vec4/tes: fix setup_payload() for 64bit data types
> i965/vec4/tes: consider register offsets during attribute setup
> i965/vec4: dump subnr for FIXED_GRF
> i965/vec4: split instructions that read 64-bit interleaved attributes
> i965/vec4/scalarize_df: do not scalarize swizzles that we can support
> natively
> i965/vec4/scalarize_df: support more swizzles via vstride=0
> i965/vec4: prevent src/dst hazards during 64-bit register allocation
> i965/vec4: run scalarize_df() after spilling
> i965/vec4: support basic spilling of 64-bit registers
> i965/vec4: avoid spilling of registers that mix 32-bit and 64-bit
> access
> i965/vec4: prevent spilling of DOUBLE_TO_SINGLE destination
> i965/vec4: adjust spilling costs for 64-bit registers.
> i965/vec4: enable ARB_gpu_shader_fp64 for Haswell
> i965/gen7: expose OpenGL 4.0 on Haswell
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161213/ec764b80/attachment.sig>


More information about the mesa-dev mailing list