[Mesa-dev] [PATCH 6/6] i965/gen7: Add instruction latency estimates for untyped atomics and reads.

Fri Nov 1 13:04:57 PDT 2013

Paul Berry <stereotype441 at gmail.com> writes:

> On 29 October 2013 16:37, Francisco Jerez <currojerez at riseup.net> wrote:
>
>> The latency information has been obtained empirically from
>> measurements taken on Haswell and Ivy Bridge.
>> ---
>>  .../drivers/dri/i965/brw_schedule_instructions.cpp | 41
>> ++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> index 944b5c8..cbfaabe 100644
>> --- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> @@ -329,6 +329,47 @@ schedule_node::set_latency_gen7(bool is_haswell)
>>        latency = 200;
>>        break;
>>
>> +   case SHADER_OPCODE_UNTYPED_ATOMIC:
>> +      /* Test code:
>> +       *   mov(8)    g112<1>ud       0x00000000ud       { align1 WE_all
>> 1Q };
>> +       *   mov(1)    g112.7<1>ud     g1.7<0,1,0>ud      { align1 WE_all };
>> +       *   mov(8)    g113<1>ud       0x00000000ud       { align1
>> WE_normal 1Q };
>> +       *   send(8)   g4<1>ud         g112<8,8,1>ud
>> +       *             data (38, 5, 6) mlen 2 rlen 1      { align1
>> WE_normal 1Q };
>> +       *
>> +       * Running it 100 times as fragment shader on a 128x128 quad
>> +       * gives an average latency of 13867 cycles per atomic op,
>> +       * standard deviation 3%.  Note that this is a rather
>> +       * pessimistic estimate, the actual latency in cases with few
>> +       * collisions between threads and favorable pipelining has been
>> +       * seen to be reduced by a factor of 100.
>> +       */
>> +      latency = 14000;
>>
>
> Wow, that's a really huge latency.  Given your argument in the comment, I
> suspect that in practice, shaders that use atomic counters are going to be
> a lot closer to the "few collisions between threads and favorable
> pipelining" case than they are going to be to this pessimistic estimate.
> Personally, I'd be inclined to make the latency the same as
> SHADER_OPCODE_UNTYPED_SURFACE_READ.
>
I suspect that the most common application for atomic counters is to
generate unique indices for each fragment or vertex of each primitive,
and that typically involves hundreds of simultaneous atomic increment
ops to the same memory location that have to be serialized by the
hardware...  I'm afraid that the actual latency that we're going to see
won't to be too far off from my pessimistic estimate... :/

> But I'm not an expert on scheduling latencies so I'll defer to Eric and
> Matt.  Consider this patch:
>
> Acked-by: Paul Berry <stereotype441 at gmail.com>
>
> I made comments on all the other patches in the series except patch 3.
> Patch 3 is:
>
> Reviewed-by: Paul Berry <stereotype441 at gmail.com>
>

Thank you for all your comments Paul.  I've updated my atomic counters
branch [1] with your reviewed-by tags and taking into account your
suggestions.

[1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=atomic-counters

>
>> +      break;
>> +
>> +   case SHADER_OPCODE_UNTYPED_SURFACE_READ:
>> +      /* Test code:
>> +       *   mov(8)    g112<1>UD       0x00000000UD       { align1 WE_all
>> 1Q };
>> +       *   mov(1)    g112.7<1>UD     g1.7<0,1,0>UD      { align1 WE_all };
>> +       *   mov(8)    g113<1>UD       0x00000000UD       { align1
>> WE_normal 1Q };
>> +       *   send(8)   g4<1>UD         g112<8,8,1>UD
>> +       *             data (38, 6, 5) mlen 2 rlen 1      { align1
>> WE_normal 1Q };
>> +       *   .
>> +       *   . [repeats 8 times]
>> +       *   .
>> +       *   mov(8)    g112<1>UD       0x00000000UD       { align1 WE_all
>> 1Q };
>> +       *   mov(1)    g112.7<1>UD     g1.7<0,1,0>UD      { align1 WE_all };
>> +       *   mov(8)    g113<1>UD       0x00000000UD       { align1
>> WE_normal 1Q };
>> +       *   send(8)   g4<1>UD         g112<8,8,1>UD
>> +       *             data (38, 6, 5) mlen 2 rlen 1      { align1
>> WE_normal 1Q };
>> +       *
>> +       * Running it 100 times as fragment shader on a 128x128 quad
>> +       * gives an average latency of 583 cycles per surface read,
>> +       * standard deviation 0.9%.
>> +       */
>> +      latency = is_haswell ? 300 : 600;
>> +      break;
>> +
>>     default:
>>        /* 2 cycles:
>>         * mul(8) g4<1>F g2<0,1,0>F      0.5F            { align1 WE_normal
>> 1Q };
>> --
>> 1.8.3.4
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 229 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20131101/e4da34b5/attachment-0001.pgp>