[Mesa-dev] Perfetto CPU/GPU tracing

Thu Feb 18 20:07:48 UTC 2021

On 18/02/2021 20:26, Tamminen, Eero T wrote:
> Hi,
> 
> (This isn't anymore that related to Mesa, but maybe it's still of
> interest.)
> 
> On Thu, 2021-02-18 at 16:40 +0100, Primiano Tucci wrote:
> 
>> On 18/02/2021 14:35, Tamminen, Eero T wrote:
> [...]
>>> It doesn't require executable code to be writable from user-space,
>>> library code can remain read-only because kernel can toggle relevant
>>> page writable for uprobe breakpoint setup and back.
>>
>> The problem is not who rewrites the .text pages (although, yes, I agree
>> that the kernel doing this is better than userspace doing it). The
>> problem is:
>>
>> 1. Losing the ability to verify the integrity of system executables.
>> tell if some malware/rootkit did alter them or uprobes did. Effectively
>> you lose the ability to verify the full chain of bootloader -> system
>> image -> file integrity.
> 
> Why you would lose it?
> 
> Integrity checks will succeed when there are no trace points enabled,
> and trace points should be enabled only when you start tracing, so you
> know what is causing integrity check failures (especially when they
> start passing again once you disable tracepoint

If you do this (disabling when tracing) the message out there becomes: 
"if you write malware, the very first thing you should do is enabling 
tracing, so any system integrity check will be suppressed" :)

Things like uprobes (i.e. anything that can dynamically alter the 
execution flow of system processes) is typically available only on 
engineering setups, where you have control of the device / kernel / 
security settings (Yama, selinux or any other security module),  not on 
production devices.
I understand that the situation for most (all?) Linux-based distros is 
different as you can just sudo. But on many other embedded OSes  - at 
least Google ones like  CrOS and Android - the security model is way 
stricter.
We could argue that is bad / undesirable / too draconian but that is 
something that any of us has the power to change. At some point each 
platform decides where it wants to be in the spectrum of "easy to hack" 
and "secure for the user". CrOS model is: you can hack as much as you 
want, but you need first to re-flash it in dev-mode.

>> 2. In general, a mechanism that allows dynamic rewriting of code is a
>> wide attack surface, not welcome on production devices (for the same
>> very unlikely to fly for non-dev images IMHO. Many system processes
>> contain too sensitive information like cookie jar, oauth2 tokens etc.
> 
> Isn't there any kind of dev-mode which would be required to enable
> things that are normally disallowed?

That requires following steps that are non-trivial for non-tech-savy 
users and, more importantly, wiping the device (CrOS calls this 
"power-washing") [1].
We can't ask users to reflash their device just to give us a trace when 
they are experiencing problems. Many of those problems can't be 
reproduced by engineers because depend on some peculiar state the user 
is in. A recent example (not related with Mesa): some users were 
experiencing an extremely unresponsive (Chrome) UI. After looking at 
traces engineers figured out that the root cause (and hence the repro) 
was: "you need to have a (chrome) tab which title is long enough to 
cause ellipsis and that also has an emoji in the left-most visible part. 
The emoji causes invalidation of the cached font measurement (this is 
the bug), which causes every UI draw to be awfully slow.
For problems like this (which are very frequent) we really need to ask 
users to give us traces. And that needs to be really a one-click thing 
for them or they will not be able to help us.

[1] 
https://www.chromium.org/chromium-os/chromiumos-design-docs/developer-mode
> 
> (like kernel modifying RO mapped user-space process memory pages)
> 
> 
>>
> [...]
>>> Yes, if you need more context, or handle really frequent events,
>>> static
>>> breakpoints are a better choice.
>>>
>>>
>>> In case of more frequent events, on Linux one might consider using
>>> some
>>> BPF program to process dynamic tracepoint data so that much smaller
>>> amount needs to be transferred to user-space.  But I'm not sure
>>> whether
>>> support for attaching BPF to tracepoints is in upstream Linux kernel
>>> yet.
>>
>> eBPF, which you can use in recent kernels with tracepoints, solves
>> different problem. It solves e.g., (1) dynamic filtering or (2)
>> computing aggregations from hi-freq events. It doesn't solve problems
>> like "I want to see all scheduling events and all frame-related
>> userspace instrumentation points. But given that sched events are so
>> hi-traffic I want to put them in a separate buffer, so they don't
>> clobber all the rest". Turning scheduling events into a histogram
>> (something you can do with eBPF+tracepoints) doesn't really solve cases
>> where you want to follow the full scheduling block/wake chain while some
>> userspace events taking unexpectedly long.
> 
> You could e.g. filter out all sched events except ones for the process
> you're interested about.  That should already provide huge reduction in
> amount of data, for use-cases where scheduling of rest of processes is
> of less interest.

Yeah but in many cases you don't know upfront which are the sched events 
that you are interested in until you see the trace. On modern embedded 
OSes where everything, even fetching a font or play a notification 
sounds, requires some IPC with various services, it's very hard to tell 
upfront what the critical path is.

> 
> However, I think high frequency kernel tracing is a different use-case
> from user-space tracing, which requires its own tooling [1] (and just
> few user-space trace points to provide context for traced kernel
> activity).

I disagree. In my (mostly Android related) experience what engineers 
need is the union of kernel (specifically scheduling) tracing AND 
userspace tracing **on the same timeline**. Userspace tracing tells when 
(on the timeline) something that was important for the user (or for the 
app lifecycle) happened / took too much time. Kernel tracing helps 
understanding the real reasons why. This is especially true for cases of 
lock contention or priority inversions, where the kernel traces can 
explain why things didn't happen in time, which task (and eventually 
callstack, via perf_event_open) did signal the mutex that we blocked on, 
and so on.

> 
> 
> 	- Eero
> 
> [1] In corporate setting I would expect this kind of latency
> investigations to be actually HW assisted, otherwise tracing itself
> disturbs the system too much.  Ultimately it could be using instruction
> branch tracing to catch *everything*

HW-assisted tracing via LBR is definitely an extremely interesting power 
tool. Whether it's a must-have or a nice-to-have depends really on the 
classes of problems one needs to investigate.
For system-architecture-issues (interaction between N processes across 
IPC, or across VMs) that level of refinement (minimal overhead) is 
typically not required. For micro-architecture problems (CPU 
pipeline-related, cache efficiency, branch prediction hit ratio and the 
like) it is.

 > , as both ARM and x86 have HW support for that.

Not really, the situation for ARM is more complicated. IIRC to the day 
the only LBR-equivalent on ARM require the Embedded Trace Macrocell 
(ETM) hw. But ETM  is very expensive in term of silicon area and is 
typically present only on pre-production / testing devices. I am not 
aware of any production ARM-based SoC that ships ETM.

> 
> (Instruction branch tracing doesn't include context, but that can be
> injected separately to the data stream.  Because it catches everything,
> one can infer some of the context from the trace itself too.  I don't
> think there's any good Open Source post-processing / visualization tools
> for such data though.)
>