[Intel-gfx] Design of a GPU profiling debug interface

Tue Nov 9 18:15:54 CET 2010

On Sat, 30 Oct 2010 14:04:11 +0100
Peter Clifton <pcjc2 at cam.ac.uk> wrote:

> I think I'll need some help with this. I'm by no means a kernel
> programmer, so I'm feeling my way in the dark with this.
> 
> I want to design an interface so I can synchronise my GPU idle flags
> polling with batchbuffer execution. I'm imagining at a high level, doing
> something like this in my application (or mesa). (Hand-wavey-pseudocode)
> 
> expose_event_handler ()
> {
> 	static bool one_shot_trace = true;
> 
> 	if (one_shot_trace)
> 		mesa_debug_i915_trace_idle (TRUE);
> 
> 	/* RENDERING COMMANDS IN HERE */
> 	SwapBuffers();
> 
> 	if (one_shot_trace)
> 		mesa_debug_i915_trace_idle (FALSE);
> 
> 	one_shot_trace = false;
> }
> 
> 
> I was imagining adding a flag to the EXECBUFFER2 IOCTL, or perhaps
> adding a new EXECBUFFER3 IOCTL (which I'm playing with locally now).
> Basically I just want to flag execbuffers which I'm interested in seeing
> profiling data for.
> 
> In order to get really high-resolution profiling, it would be
> advantageous to confine it to the time-period of interest otherwise the
> data rate is too high. I guestimated about 10MB/s for a binary
> representation of the data I'm currently polling in user-space. More
> spatial resolution would be nice too, so this could increase.

Would be very cool to be able to correlate the data...

> I think I have a vague idea how to do the GPU and logging parts, even if
> I end up having to start the polling before the batchbuffer starts
> executing.
> 
> What I've got little / no clue how to is manage allocation of memory to
> store the results in.
> 
> Should userspace (mesa?) be passing buffers for the kernel to return
> profiling data? Then retrieving it somehow when it "knows" the
> batchbuffer is finished? This will probably require over-allocation with
> a guestimate of required memory space to log the given batch-buffer.
> 
> What about exporting via debugfs. Assuming the above code-fragment, we
> could leave the last "frame" of polled data available, with the data
> being overwritten when the next request to start logging comes in.
> (That would perhaps require some kind of sequence number assigned if we
> have multiple batches which come under the same request... or a separate
> IOCTL to turn on / off logging).

There's also relayfs, which is made for high bandwidth kernel->user
communication.  I'm not sure if it will make this any easier, but I
think there's some documentation in the kernel about it.

A ring buffer with the last N timestamps might also be a good way of
exposing things.  Having more than one entry available means that if
userspace didn't get scheduled at the right time it would still have a
good chance of getting all the data it missed since the last read.

> 
> Also.. I'm not sure how the locking would work if userspace is reading
> out the debugfs file whilst another frame is being executed. (We'd
> probably need a secondary logging buffer allocating in that case).

The kernel implementation of the read() side of the file could do some
locking to prevent new data from corrupting a read in progress.

-- 
Jesse Barnes, Intel Open Source Technology Center