[Mesa-dev] [PATCH 6/6] i965/gen7: Add instruction latency estimates for untyped atomics and reads.

Fri Nov 1 11:02:38 PDT 2013

On Fri, Nov 1, 2013 at 10:31 AM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 29 October 2013 16:37, Francisco Jerez <currojerez at riseup.net> wrote:
>>
>> The latency information has been obtained empirically from
>> measurements taken on Haswell and Ivy Bridge.
>> ---
>>  .../drivers/dri/i965/brw_schedule_instructions.cpp | 41
>> ++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> index 944b5c8..cbfaabe 100644
>> --- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
>> @@ -329,6 +329,47 @@ schedule_node::set_latency_gen7(bool is_haswell)
>>        latency = 200;
>>        break;
>>
>> +   case SHADER_OPCODE_UNTYPED_ATOMIC:
>> +      /* Test code:
>> +       *   mov(8)    g112<1>ud       0x00000000ud       { align1 WE_all
>> 1Q };
>> +       *   mov(1)    g112.7<1>ud     g1.7<0,1,0>ud      { align1 WE_all
>> };
>> +       *   mov(8)    g113<1>ud       0x00000000ud       { align1
>> WE_normal 1Q };
>> +       *   send(8)   g4<1>ud         g112<8,8,1>ud
>> +       *             data (38, 5, 6) mlen 2 rlen 1      { align1
>> WE_normal 1Q };
>> +       *
>> +       * Running it 100 times as fragment shader on a 128x128 quad
>> +       * gives an average latency of 13867 cycles per atomic op,
>> +       * standard deviation 3%.  Note that this is a rather
>> +       * pessimistic estimate, the actual latency in cases with few
>> +       * collisions between threads and favorable pipelining has been
>> +       * seen to be reduced by a factor of 100.
>> +       */
>> +      latency = 14000;
>
>
> Wow, that's a really huge latency.  Given your argument in the comment, I
> suspect that in practice, shaders that use atomic counters are going to be a
> lot closer to the "few collisions between threads and favorable pipelining"
> case than they are going to be to this pessimistic estimate.  Personally,
> I'd be inclined to make the latency the same as
> SHADER_OPCODE_UNTYPED_SURFACE_READ.
>
> But I'm not an expert on scheduling latencies so I'll defer to Eric and
> Matt.  Consider this patch:

That seems reasonable to me. Once the latency is an order of magnitude
more than any other instruction, it kind of stops mattering for
scheduling purposes.

Either way:
Reviewed-by: Matt Turner <mattst88 at gmail.com>