[Mesa-dev] [RFC] ARB_shader_clock, hardware counters and i965

Connor Abbott cwabbott0 at gmail.com
Mon Oct 5 15:02:03 PDT 2015


On Mon, Oct 5, 2015 at 1:06 PM, Emil Velikov <emil.l.velikov at gmail.com> wrote:
> On 5 October 2015 at 17:11, Connor Abbott <cwabbott0 at gmail.com> wrote:
>> On Mon, Oct 5, 2015 at 11:36 AM, Emil Velikov <emil.l.velikov at gmail.com> wrote:
>>> Hi all,
>>>
>>> I am looking at ARB_shader_clock with i965 in mind.
>>>
>>> So far I've got the most of the infra/plumbing, and a fancy a new intrinsic :)
>>>
>>> On the hardware side, I was thinking about using the Observability
>>> Architecture (OA) counters. The fun part is that those tend to vary
>>> quite a bit based on the hardware generation. So far I'm leaning
>>> towards:
>>>  - "Count of XXX threads dispatched to EUs" for BRW and later.
>>>  - "XXX Shader Active Time" for earlier (SNB-HSW/VLV) hardware.
>>>
>>> Do there sound appropriate, or should we opt for the various knobs in
>>> 'Flexible EU event counters' ? Is there some alternative piece of
>>> hardware in i965, which I can use ?
>>>
>>>
>>> Going for OA has a small catch. Reading through the PRM, it is not
>>> obvious if one can track the same source twice (the
>>> GL_AMD_performance_monitor implementation comes to mind). I'm about to
>>> take a closer look into brw_performance_monitor.[ch] shortly, but if
>>> any gotchas/fancy interactions come to mind let me know.
>>>
>>> Thanks
>>> Emil
>>>
>>> P.S. Does anyone recall the consensus wrt adding the 2015 extensions
>>> to GL3.txt ?
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> Hi Emil,
>>
>> I don't think you want to use the OA counters to implement
>> ARB_shader_clock. They're not exposed to the shader directly, AFAIK,
>> and they only measure things on a per-invocation granularity, whereas
>> the intent of ARB_shader_clock is to be able to measure the number of
>> cycles that individual operations take with very low latency. Instead,
>> you should read from the ARF performance register -- see page 822 of
>> Vol 7 ("3D Media GPGPU") of the Broadwell PRM (page 858 of the PDF)
>> for more details.
>>
> I knew that there should be nicer piece of hardware for this, but
> could not find it looking through the spec. The timestamp register
> looks exactly like the thing we need there.
>
>> Another interesting thing is that you can atomically read from that
>> register and also get a bit that say whether there was some event,
>> such as a context switch, since the last time you read it that would
>> make your measurement invalid. It might be useful to expose this
>> through a GLSL extension as another set of overloads:
>>
>> uint64_t clockARB(out bool valid); //once we get int64 support
>> uvec2 clock2x32ARB(out bool valid);
>>
> Are you thinking about writing up another extension, or should we just
> wire things internally as someone else does it for us ? Would you have
> any preference how to handle things when a context switch has occurred
> (for the official functions) ?

I was thinking about adding a new extension; AFAIK no one else has
exposed this in an extension before, but at least Intel HW has it. If
by "the official functions" you mean the ones in ARB_shader_clock,
then they should just read the register without worrying about whether
there was a context switch or not. They should still reset the "is
this valid" state though, since that's what the HW does and it makes
sense for cases like the example below.

>
>> and a corresponding NIR intrinsic that outputs an extra component
>> that's a boolean (i.e. 0 or ~0). That would help with implementing
>> something like INTEL_DEBUG=shader_time generically with less outliers
>> to throw away.
>>
> Exposing it via INTEL_DEBUG will be great, but first I'd stick getting
> the extension bits in place.

Oh no, I meant replacing it entirely. That is, we'd have something
above Mesa or in core Mesa that inserts code like:

layout(binding = 0, std430) buffer {
   uint time[];
};

layout(binding = 1, offset = 0) uniform atomic_uint idx;

void main() {
   uint46_t start = clockARB(); //using uint64 for brevity even if we
don't support it now
   ... //the original shader
   bool valid;
   uint64_t end = clockARB(valid);
   if (valid && end > start) {
      time[atomicCounterIncrement(idx)] = end - start;
   }
}

and we could rip out all the code inside i965 to implement
INTEL_DEBUG=shader_time, which is fragile and often in the way of
refactors.

>
> Thanks
> Emil


More information about the mesa-dev mailing list