[Mesa-dev] [PATCH 7/7] i965/fs: Add empirically-determined instruction latencies for gen7.
Eric Anholt
eric at anholt.net
Fri Dec 14 11:22:04 PST 2012
Kenneth Graunke <kenneth at whitecape.org> writes:
> On 12/07/2012 02:58 PM, Eric Anholt wrote:
>> + case SHADER_OPCODE_TEX:
>> + case SHADER_OPCODE_TXD:
>> + case SHADER_OPCODE_TXF:
>> + case SHADER_OPCODE_TXL:
>> + case SHADER_OPCODE_TXS:
>> + /* 18 cycles:
>> + * mov(8) g115<1>F 0F { align1 WE_normal 1Q };
>> + * mov(8) g114<1>F 0F { align1 WE_normal 1Q };
>> + * send(8) g4<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + *
>> + * 697 +/-49 cycles (min 610, n=26):
>> + * mov(8) g115<1>F 0F { align1 WE_normal 1Q };
>> + * mov(8) g114<1>F 0F { align1 WE_normal 1Q };
>> + * send(8) g4<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + *
>> + * So the latency on our first texture load of the batchbuffer takes
>> + * ~700 cycles, since the caches are cold at that point.
>> + *
>> + * 840 +/- 92 cycles (min 720, n=25):
>> + * mov(8) g115<1>F 0F { align1 WE_normal 1Q };
>> + * mov(8) g114<1>F 0F { align1 WE_normal 1Q };
>> + * send(8) g4<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + * send(8) g4<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + *
>> + * On the second load, it takes just an extra ~140 cycles, and after
>> + * accounting for the 14 cycles of the MOV's latency, that makes ~130.
>> + *
>> + * 683 +/- 49 cycles (min = 602, n=47):
>> + * mov(8) g115<1>F 0F { align1 WE_normal 1Q };
>> + * mov(8) g114<1>F 0F { align1 WE_normal 1Q };
>> + * send(8) g4<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + * send(8) g50<1>UW g114<8,8,1>F
>> + * sampler (10, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + *
>> + * The unit appears to be pipelined, since this matches up with the
>> + * cache-cold case, despite there being two loads here. If you replace
>> + * the g4 in the MOV to null with g50, it's still 693 +/- 52 (n=39).
>> + *
>> + * So, take some number between the cache-hot 140 cycles and the
>> + * cache-cold 700 cycles. No particular tuning was done on this.
>> + *
>> + * I haven't done significant testing of the non-TEX opcodes. TXL at
>> + * least looked about the same as TEX.
>> + */
>> + latency = 200;
>> + break;
>> +
>> + case FS_OPCODE_VARYING_PULL_CONSTANT_LOAD:
>> + case FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD:
>> + /* testing using varying-index pull constants:
>> + *
>> + * 16 cycles:
>> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal 1Q };
>> + * send(8) g4<1>F g4<8,8,1>D
>> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal 1Q };
>> + *
>> + * ~480 cycles:
>> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal 1Q };
>> + * send(8) g4<1>F g4<8,8,1>D
>> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + *
>> + * ~620 cycles:
>> + * mov(8) g4<1>D g2.1<0,1,0>F { align1 WE_normal 1Q };
>> + * send(8) g4<1>F g4<8,8,1>D
>> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + * send(8) g4<1>F g4<8,8,1>D
>> + * data (9, 2, 3) mlen 1 rlen 1 { align1 WE_normal 1Q };
>> + * mov(8) null g4<8,8,1>F { align1 WE_normal 1Q };
>> + *
>> + * So, if it's cache-hot, it's about 140. If it's cache cold, it's
>> + * about 460. We expect to mostly be cache hot, so pick something more
>> + * in that direction.
>> + */
>> + latency = 200;
>> + break;
>
> Painful. Your "we expect to mostly be cache hot" comment makes sense,
> except that Ivybridge's caches are awful when the same cacheline is
> accessed within 16 cycles or so.
>
> I'd really love to see some timing data on using LD messages (to get the
> L1 and L2 caches). See my old patch that we couldn't justify:
I think we'll probably only justify this one through whole app testing.
The uniform load we're using now *is* faster (480/620 for 1 or 2 loads
vs 697/840 using texturing), as long as you don't hit the bug
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20121214/ef776e58/attachment.pgp>
More information about the mesa-dev
mailing list