[Mesa-dev] Perfetto CPU/GPU tracing

Lionel Landwerlin lionel.g.landwerlin at intel.com
Sat Feb 13 01:56:03 UTC 2021


On 13/02/2021 03:38, Rob Clark wrote:
> On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
> <lionel.g.landwerlin at intel.com> wrote:
>> We're kind of in the same boat for Intel.
>>
>> Access to GPU perf counters is exclusive to a single process if you want
>> to build a timeline of the work (because preemption etc...).
> ugg, does that mean extensions like AMD_performance_monitor doesn't
> actually work on intel?


It work,s but only a single app can use it at a time.


>
>> The best information we could add from mesa would a timestamp of when a
>> particular drawcall started.
>> But that's pretty much when timestamps queries are.
>>
>> Were you thinking of particular GPU generated data you don't get from
>> gfx-pps?
> >From the looks of it, currently I don't get *any* GPU generated data
> from gfx-pps ;-)


Maybe file a bug? : 
https://gitlab.freedesktop.org/Fahien/gfx-pps/-/blob/master/src/gpu/intel/intel_driver.cc


>
> We can ofc sample counters from a separate process as well... I have a
> curses tool (fdperf) which does this.. but running outside of gpu
> cmdstream plus counters losing context across suspend/resume makes it
> less than perfect.


Our counters are global so to give per application values, we need to 
post process a stream of HW counter snapshots.


>    And something that works the same way as
> AMD_performance_monitor under the hook gives a more precise look at
> which shaders (for ex) are consuming the most cycles.


In our implementation that precision (in particular when a drawcall 
ends) comes at a stalling cost unfortunately.


>    For cases where
> we can profile a trace, frameretrace and related tools is pretty
> great.. but it would be nice to have similar visibility for actual
> games (which for me, mostly means android games, since so far no
> aarch64 steam store), but also give game developers good tools (or at
> least the same tools that they get with other closed src drivers on
> android).


Sure, but frame analysis is different than live monitoring of the system.

On Intel's HW you don't get the same level of details in both cases, and 
apart for a few timestamps, I think gfx-pps is as good as you gonna get 
for live stuff.


-Lionel


>
> BR,
> -R
>
>> Thanks,
>>
>> -Lionel
>>
>>
>> On 13/02/2021 00:12, Alyssa Rosenzweig wrote:
>>> My 2c for Mali/Panfrost --
>>>
>>> For us, capturing GPU perf counters is orthogonal to rendering. It's
>>> expected (e.g. with Arm's tools) to do this from a separate process.
>>> Neither Mesa nor the DDK should require custom instrumentation for the
>>> low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
>>> Perfetto as it is. So for us I don't see the value in modifying Mesa for
>>> tracing.
>>>
>>> On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
>>>> (responding from correct address this time)
>>>>
>>>> On Fri, Feb 12, 2021 at 12:03 PM Mark Janes <mark.a.janes at intel.com> wrote:
>>>>
>>>>> I've recently been using GPUVis to look at trace events.  On Intel
>>>>> platforms, GPUVis incorporates ftrace events from the i915 driver,
>>>>> performance metrics from igt-gpu-tools, and userspace ftrace markers
>>>>> that I locally hack up in Mesa.
>>>>>
>>>> GPUVis is great. I would love to see that data combined with
>>>> userspace events without any need for local hacks. Perfetto provides
>>>> on-demand trace events with lower overhead compared to ftrace, so for
>>>> example it is acceptable to have production trace instrumentation that can
>>>> be captured without dev builds. To do that with ftrace it may require a way
>>>> to enable and disable the ftrace file writes to avoid the overhead when
>>>> tracing is not in use. This is what Android does with systrace/atrace, for
>>>> example, it uses Binder to notify processes about trace sessions. Perfetto
>>>> does that in a more portable way.
>>>>
>>>>
>>>>> It is very easy to compile the GPUVis UI.  Userspace instrumentation
>>>>> requires a single C/C++ header.  You don't have to access an external
>>>>> web service to analyze trace data (a big no-no for devs working on
>>>>> preproduction hardware).
>>>>>
>>>>> Is it possible to build and run the Perfetto UI locally?
>>>> Yes, local UI builds are possible
>>>> <https://github.com/google/perfetto/blob/5ff758df67da94d17734c2e70eb6738c4902953e/ui/README.md>.
>>>> Also confirmed with the perfetto team <https://discord.gg/35ShE3A> that
>>>> trace data is not uploaded unless you use the 'share' feature.
>>>>
>>>>
>>>>>     Can it display
>>>>> arbitrary trace events that are written to
>>>>> /sys/kernel/tracing/trace_marker ?
>>>> Yes, I believe it does support that via linux.ftrace data source
>>>> <https://perfetto.dev/docs/quickstart/linux-tracing>. We use that for
>>>> example to overlay CPU sched data to show what process is on each core
>>>> throughout the timeline. There are many ftrace event types
>>>> <https://github.com/google/perfetto/tree/5ff758df67da94d17734c2e70eb6738c4902953e/protos/perfetto/trace/ftrace>
>>>> in
>>>> the perfetto protos.
>>>>
>>>>
>>>>> Can it be extended to show i915 and
>>>>> i915-perf-recorder events?
>>>>>
>>>> It can be extended to consume custom data sources. One way this is done is
>>>> via a bridge daemon, such as traced_probes which is responsible for
>>>> capturing data from ftrace and /proc during a trace session and sending it
>>>> to traced. traced is the main perfetto tracing daemon that notifies all
>>>> trace data sources to start/stop tracing and communicates with user tracing
>>>> requests via the 'perfetto' command.
>>>>
>>>>
>>>>
>>>>> John Bates <jbates at chromium.org> writes:
>>>>>
>>>>>> I recently opened issue 4262
>>>>>> <https://gitlab.freedesktop.org/mesa/mesa/-/issues/4262> to begin the
>>>>>> discussion on integrating perfetto into mesa.
>>>>>>
>>>>>> *Background*
>>>>>>
>>>>>> System-wide tracing is an invaluable tool for developers to find and fix
>>>>>> performance problems. The perfetto project enables a combined view of
>>>>> trace
>>>>>> data from kernel ftrace, GPU driver and various manually-instrumented
>>>>>> tracepoints throughout the application and system. This helps developers
>>>>>> quickly answer questions like:
>>>>>>
>>>>>>      - How long are frames taking?
>>>>>>      - What caused a particular frame drop?
>>>>>>      - Is it CPU bound or GPU bound?
>>>>>>      - Did a CPU core frequency drop cause something to go slower than
>>>>> usual?
>>>>>>      - Is something else running that is stealing CPU or GPU time? Could I
>>>>>>      fix that with better thread/context priorities?
>>>>>>      - Are all CPU cores being used effectively? Do I need
>>>>> sched_setaffinity
>>>>>>      to keep my thread on a big or little core?
>>>>>>      - What’s the latency between CPU frame submit and GPU start?
>>>>>>
>>>>>> *What Does Mesa + Perfetto Provide?*
>>>>>>
>>>>>> Mesa is in a unique position to produce GPU trace data for several GPU
>>>>>> vendors without requiring the developer to build and install additional
>>>>>> tools like gfx-pps <https://gitlab.freedesktop.org/Fahien/gfx-pps>.
>>>>>>
>>>>>> The key is making it easy for developers to use. Ideally, perfetto is
>>>>>> eventually available by default in mesa so that if your system has
>>>>> perfetto
>>>>>> traced running, you just need to run perfetto (perhaps along with setting
>>>>>> an environment variable) with the mesa categories to see:
>>>>>>
>>>>>>      - GPU processing timeline events.
>>>>>>      - GPU counters.
>>>>>>      - CPU events for potentially slow functions in mesa like shader
>>>>> compiles.
>>>>>> Example of what this data might look like (with fake GPU events):
>>>>>> [image: percetto-gpu-example.png]
>>>>>>
>>>>>> *Runtime Characteristics*
>>>>>>
>>>>>>      - ~500KB additional binary size. Even with using only the basic
>>>>> features
>>>>>>      of perfetto, it will increase the binary size of mesa by about 500KB.
>>>>>>      - Background thread. Perfetto uses a background thread for
>>>>> communication
>>>>>>      with the system tracing daemon (traced) to advertise trace data and
>>>>> get
>>>>>>      notification of trace start/stop.
>>>>>>      - Runtime overhead when disabled is designed to be optimal with one
>>>>>>      predicted branch, typically a few CPU cycles
>>>>>>      <https://perfetto.dev/docs/instrumentation/track-events#performance>
>>>>> per
>>>>>>      event. While enabled, the overhead can be around 1 us per event.
>>>>>>
>>>>>> *Integration Challenges*
>>>>>>
>>>>>>      - The perfetto SDK is C++ and designed around macros, lambdas, inline
>>>>>>      templates, etc. There are ongoing discussions on providing an official
>>>>>>      perfetto C API, but it is not yet clear when this will land on the
>>>>> perfetto
>>>>>>      roadmap.
>>>>>>      - The perfetto SDK is an amalgamated .h and .cc that adds up to 100K
>>>>>>      lines of code.
>>>>>>      - Anything that includes perfetto.h takes a long time to compile.
>>>>>>      - The current Perfetto SDK design is incompatible with being a shared
>>>>>>      library behind a C API.
>>>>>>
>>>>>> *Percetto*
>>>>>>
>>>>>> The percetto library <https://github.com/olvaffe/percetto> was recently
>>>>>> implemented to provide an interim C API for perfetto. It provides
>>>>> efficient
>>>>>> support for scoped trace events, multiple categories, counters, custom
>>>>>> timestamps, and debug data annotations. Percetto also provides some
>>>>>> features that are important to mesa, but not available yet with perfetto
>>>>>> SDK:
>>>>>>
>>>>>>      - Trace events from multiple perfetto instances in separate shared
>>>>>>      libraries (like mesa and virglrenderer) show correctly in a single
>>>>> process
>>>>>>      and thread view.
>>>>>>      - Counter tracks and macro API.
>>>>>>
>>>>>> Percetto is missing API for perfetto's GPU DataSource and counter
>>>>> support,
>>>>>> but that feature could be implemented next if it is important for mesa.
>>>>>> With the existing percetto API mesa could present GPU trace data as named
>>>>>> 'slice' events and int64_t counters with custom timestamps as shown in
>>>>> the
>>>>>> image above (based on this sample
>>>>>> <https://github.com/olvaffe/percetto/blob/main/examples/timestamps.c>).
>>>>>>
>>>>>> *Mesa Integration Alternatives*
>>>>>>
>>>>>> Note: we have some pressing needs for performance analysis in Chrome OS,
>>>>> so
>>>>>> I'm intentionally leaving out the alternative of waiting for an official
>>>>>> perfetto C API. Of course, once that C API is available it would become
>>>>> an
>>>>>> option to migrate to it from any of the alternatives below.
>>>>>>
>>>>>> Ordered by difficulty with easiest first:
>>>>>>
>>>>>>      1. Statically link with percetto as an optional external dependency
>>>>>> (virglrenderer
>>>>>>      now has this approach
>>>>>>      <
>>>>> https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/480>
>>>>>>      ).
>>>>>>      - Pros: API already supports most common tracing needs. Tested and
>>>>> used
>>>>>>         by an increasing number of CrOS components.
>>>>>>         - Cons: External dependency for optional mesa build option.
>>>>>>      2. Embed Perfetto SDK + a Percetto fork/copy.
>>>>>>         - Pros: API already supports most common tracing needs. No added
>>>>>>         external dependency for mesa.
>>>>>>         - Cons: Percetto code divergence, bug fixes need to land in two
>>>>> trees.
>>>>>>      3. Embed Perfetto SDK + custom C wrapper.
>>>>>>         - Pros: Tailored API for mesa's needs.
>>>>>>         - Cons: Nontrivial development efforts and maintenance.
>>>>>>      4. Generate C stubs for the Perfetto protobuf and reimplement the
>>>>>>      Perfetto SDK in C.
>>>>>>         - Pros: Tailored API for mesa's needs. Possible smaller binary
>>>>> impact
>>>>>>         from simpler implementation.
>>>>>>         - Cons: Significant development efforts and maintenance.
>>>>>>
>>>>>> Regardless of the integration direction, I expect we would disable
>>>>> perfetto
>>>>>> in the default build for now to minimize disruption.
>>>>>>
>>>>>> I like #1, because there are some nontrivial subtleties to the C wrapper
>>>>>> that provide both API conveniences and runtime performance that would
>>>>> need
>>>>>> to be reimplemented or maintained with the other options. I will also
>>>>>> volunteer to do #1 or #2, but I'm not sure I have time for #3 or #4 :D.
>>>>>>
>>>>>> Any other thoughts on how best to integrate perfetto into mesa?
>>>>>>
>>>>>> -jb
>>>>>> _______________________________________________
>>>>>> mesa-dev mailing list
>>>>>> mesa-dev at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev




More information about the mesa-dev mailing list