[Intel-gfx] [PATCH 2/2] drm/i915/tracepoints: Remove DRM_I915_LOW_LEVEL_TRACEPOINTS Kconfig option

Tue Jun 26 11:48:51 UTC 2018

Quoting Tvrtko Ursulin (2018-06-26 12:24:51)
> 
> On 26/06/2018 11:55, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-06-26 11:46:51)
> >>
> >> On 25/06/2018 21:02, Chris Wilson wrote:
> >>> If we know what is wanted can we define that better in terms of
> >>> dma_fence and leave lowlevel for debugging (or think of how we achieve
> >>> the same with generic bpf? kprobes)? Hmm, I wonder how far we can push
> >>> that.
> >>
> >> What is wanted is for instance take trace.pl on any kernel anywhere and
> >> it is able to deduce/draw the exact metrics/timeline of command
> >> submission for an workload.
> >>
> >> At the moment it without low level tracepoints, and without the
> >> intel_engine_notify tweak, it is workload dependent on how close it
> >> could get.
> > 
> > Interjecting what dma-fence already has (or we could use), not sure how
> > well userspace can actually map it to their timelines.
> >>
> >> So a set of tracepoints to allow drawing the timeline:
> >>
> >> 1. request_queue (or _add)
> > dma_fence_init
> > 
> >> 2. request_submit
> > 
> >> 3. intel_engine_notify
> > For obvious reasons, no match in dma_fence.
> > 
> >> 4. request_in
> > dma_fence_emit
> > 
> >> 5. request out
> > dma_fence_signal (similar, not quite, we would have to force irq
> > signaling).
> 
> Yes not quite the same due potential time shift between user interrupt 
> and dma_fence_signal call via different paths.
> 
> >   
> >> With this set the above is possible and we don't need a lot of work to
> >> get there.
> > 
> >  From a brief glance we are missing a dma_fence_queue for request_submit
> > replacement.
> > 
> > So next question is what information do we get from our tracepoints (or
> > more precisely do you use) that we lack in dma_fence?
> 
> Port=%u and preemption (completed=%u) comes immediately to mind. Way to 
> tie with engines would be nice or it is all abstract timelines.
> 
> Going this direction sounds like a long detour to get where we almost 
> are. I suspect you are valuing the benefit of it being generic and hence 
> and parsing tool could be cross-driver. But you can also just punt the 
> "abstractising" into the parsing tool.

It's just that this about the third time this has been raised in the
last couple of weeks with the other two requests being from a generic
tooling pov (Eric Anholt for gnome-shell tweaking, and some one
else looking for a gpuvis-like tool). So it seems like there is
interest, even if I doubt that it'll help answer any questions beyond
what you can just extract from looking at userspace. (Imo, the only
people these tracepoints are useful for are people writing patches for
the driver. For everyone else, you can just observe system behaviour and
optimise your code for your workload. Otoh, can one trust a black
box, argh.)

To have a second set of nearly equivalent tracepoints, we need to have
strong justification why we couldn't just use or extend the generic set.

Plus I feel a lot more comfortable exporting a set of generic
tracepoints, than those where we may be leaking more knowledge of the HW
than we can reasonably expect to support for the indefinite future.

> >> And with the Virtual Engine it will become more interesting to have
> >> this. So if we had a bug report saying load balancing is not working
> >> well, we could just say "please run it via trace.pl --trace and attach
> >> perf script output". That way we could easily see whether or not is is a
> >> problem in userspace behaviour or else.
> > 
> > And there I was wanting a script to capture the workload so that we
> > could replay it and dissect it. :-p
> 
> Depends on what level you want that. Perf script output from the above 
> tracepoints would do on one level. If you wanted a higher level to 
> re-exercise load balancing then it wouldn't completely be enough, or at 
> least a lot of guesswork would be needed.

It all depends on what level you want to optimise, is the way I look at
it. Userspace driver, you capture the client->driver userspace API (e.g.
cairo-trace, apitrace). But for optimising scheduling layout, we just
need a workload descriptor like wsim -- with perhaps the only tweak
being able to define latency/throughput metrics relevant to that
workload, and being able to integrate with a pseudo display server. The
challenge as I see it is being able to convince the user that it is a
useful diagnosis step and being able to generate a reasonable wsim
automatically.
-Chris