[Mesa-dev] [PATCH 7/7] i965/fs: Add empirically-determined instruction latencies for gen7.

Fri Dec 14 11:22:04 PST 2012

Kenneth Graunke <kenneth at whitecape.org> writes:
> On 12/07/2012 02:58 PM, Eric Anholt wrote:

>> +   case SHADER_OPCODE_TEX:
>> +   case SHADER_OPCODE_TXD:
>> +   case SHADER_OPCODE_TXF:
>> +   case SHADER_OPCODE_TXL:
>> +   case SHADER_OPCODE_TXS:
>> +      /* 18 cycles:
>> +       * mov(8)  g115<1>F   0F                              { align1 WE_normal 1Q };
>> +       * mov(8)  g114<1>F   0F                              { align1 WE_normal 1Q };
>> +       * send(8) g4<1>UW    g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       *
>> +       * 697 +/-49 cycles (min 610, n=26):
>> +       * mov(8)  g115<1>F   0F                              { align1 WE_normal 1Q };
>> +       * mov(8)  g114<1>F   0F                              { align1 WE_normal 1Q };
>> +       * send(8) g4<1>UW    g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       * mov(8)  null       g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       *
>> +       * So the latency on our first texture load of the batchbuffer takes
>> +       * ~700 cycles, since the caches are cold at that point.
>> +       *
>> +       * 840 +/- 92 cycles (min 720, n=25):
>> +       * mov(8)  g115<1>F   0F                              { align1 WE_normal 1Q };
>> +       * mov(8)  g114<1>F   0F                              { align1 WE_normal 1Q };
>> +       * send(8) g4<1>UW    g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       * mov(8)  null       g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       * send(8) g4<1>UW    g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       * mov(8)  null       g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       *
>> +       * On the second load, it takes just an extra ~140 cycles, and after
>> +       * accounting for the 14 cycles of the MOV's latency, that makes ~130.
>> +       *
>> +       * 683 +/- 49 cycles (min = 602, n=47):
>> +       * mov(8)  g115<1>F   0F                              { align1 WE_normal 1Q };
>> +       * mov(8)  g114<1>F   0F                              { align1 WE_normal 1Q };
>> +       * send(8) g4<1>UW    g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       * send(8) g50<1>UW   g114<8,8,1>F
>> +       *   sampler (10, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
>> +       * mov(8)  null       g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       *
>> +       * The unit appears to be pipelined, since this matches up with the
>> +       * cache-cold case, despite there being two loads here.  If you replace
>> +       * the g4 in the MOV to null with g50, it's still 693 +/- 52 (n=39).
>> +       *
>> +       * So, take some number between the cache-hot 140 cycles and the
>> +       * cache-cold 700 cycles.  No particular tuning was done on this.
>> +       *
>> +       * I haven't done significant testing of the non-TEX opcodes.  TXL at
>> +       * least looked about the same as TEX.
>> +       */
>> +      latency = 200;
>> +      break;
>> +
>> +   case FS_OPCODE_VARYING_PULL_CONSTANT_LOAD:
>> +   case FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD:
>> +      /* testing using varying-index pull constants:
>> +       *
>> +       * 16 cycles:
>> +       * mov(8)  g4<1>D  g2.1<0,1,0>F                    { align1 WE_normal 1Q };
>> +       * send(8) g4<1>F  g4<8,8,1>D
>> +       *   data (9, 2, 3) mlen 1 rlen 1                  { align1 WE_normal 1Q };
>> +       *
>> +       * ~480 cycles:
>> +       * mov(8)  g4<1>D  g2.1<0,1,0>F                    { align1 WE_normal 1Q };
>> +       * send(8) g4<1>F  g4<8,8,1>D
>> +       *   data (9, 2, 3) mlen 1 rlen 1                  { align1 WE_normal 1Q };
>> +       * mov(8)  null    g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       *
>> +       * ~620 cycles:
>> +       * mov(8)  g4<1>D  g2.1<0,1,0>F                    { align1 WE_normal 1Q };
>> +       * send(8) g4<1>F  g4<8,8,1>D
>> +       *   data (9, 2, 3) mlen 1 rlen 1                  { align1 WE_normal 1Q };
>> +       * mov(8)  null    g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       * send(8) g4<1>F  g4<8,8,1>D
>> +       *   data (9, 2, 3) mlen 1 rlen 1                  { align1 WE_normal 1Q };
>> +       * mov(8)  null    g4<8,8,1>F                      { align1 WE_normal 1Q };
>> +       *
>> +       * So, if it's cache-hot, it's about 140.  If it's cache cold, it's
>> +       * about 460.  We expect to mostly be cache hot, so pick something more
>> +       * in that direction.
>> +       */
>> +      latency = 200;
>> +      break;
>
> Painful.  Your "we expect to mostly be cache hot" comment makes sense, 
> except that Ivybridge's caches are awful when the same cacheline is 
> accessed within 16 cycles or so.
>
> I'd really love to see some timing data on using LD messages (to get the 
> L1 and L2 caches).  See my old patch that we couldn't justify:

I think we'll probably only justify this one through whole app testing.
The uniform load we're using now *is* faster (480/620 for 1 or 2 loads
vs 697/840 using texturing), as long as you don't hit the bug
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20121214/ef776e58/attachment.pgp>