[Mesa-dev] [PATCH 00/22] RFC: Batchbuffer Logger for Intel GPU

Wed Sep 27 11:37:47 UTC 2017

Hi,

 If we just want to send to the kernel the data from the trace, I can do that very easily; just make such a GEM BO, comprising of dword-pairs of (TraceCallID, BatchbufferOffset). That will be a small buffer and together with the apitrace file, will give complete data. 

 I could probably make such a dedicated tool quite quickly, or add that functionality to the logger.

-Kevin

-----Original Message-----
From: Chris Wilson [mailto:chris at chris-wilson.co.uk] 
Sent: Wednesday, September 27, 2017 1:21 PM
To: Rogovin, Kevin <kevin.rogovin at intel.com>; mesa-dev at lists.freedesktop.org
Subject: RE: [Mesa-dev] [PATCH 00/22] RFC: Batchbuffer Logger for Intel GPU

Quoting Rogovin, Kevin (2017-09-27 07:53:29)
> Hi,
> 
>  Right now the way the thing works is that it walks the batchbuffer just after the kernel returns from the ioctl and updates its internal view of the GPU state as it walks and emits to the log file the data. The log on a single batchbuffer is (essentially) just a list of call ID's from the apitrace together of "where in the batchbuffer" that call started. 
> 
>  I confess that I had not realized the potential application for using something like this to help diagnose GPU hangs! I think it is a really good idea. What I could do is the following (and it is not terribly hard to do):
> 
>    1. -BEFORE- issuing the ioctl, the logger walks just the api markers in the log of the batchbuffer, and makes a new GEM BO filled with apitrace data (call ID, and maybe GL function data) and modify the ioctl to have an extra buffer.

Yes. With the current intel_batchbuffer.c this should be relatively easy (I suggest you limit yourself to recent kernels for that simplification); see EXEC_BATCH_FIRST and remember to mark the trace bo as EXEC_OBJECT_CAPTURE.

>   2. -AFTER- the ioctl returns, emit the log data (as now) and delete the GEM BO; In order to read the GPU state more accurately I need to walk the log and update the GPU state after the ioctl (mostly out of paranoia for values copied from BO's to pipeline registers).

Up to, but my paranoia goes the other way. Once the ioctl returns the hw is indeed using that memory, so I have less trust of it. If you need to tie the relocated pointers to the trace, I would also emit relocations into the trace. For the reasons of port-mortem GPU hang debugging, I would want the execbuf be complete before the ioctl, rather than post processing.

> What would happen, is that if a batchbuffer made the GPU hang, you would then know all the GL commands (trace ID's from the API trace) that made stuff on that batchbuffer. Then one could go back to the apitrace of the troublesome application  and have a much better starting place to debug.

Yup. As times go on, I hope this becomes a more complete flight-recorder that we don't have to rely on referencing back to a separate trace to work out the interesting calls. My goal is that you can give one instruction (that doesn't require any additional dependencies, so can just be LD_PRELOAD=i965-fdr.so, or better a script installed in mesa-utils?) to a bug reporter and that will then capture enough information.

> We could also do something evil looking and put another modification on apitrace where it can have a list of call trace ranges where it inserts glFinish after each call. Those glFinish()'s will then force the ioctl of the exact troublesome draw call without needing to tell i965 to flush after each draw call.
> 
> Just to make sure, you want the "apitrace" data (call ID list, maybe function name) in a GEM BO? Which GEM BO should it be in the list so that kernel debug code know which one to use to dump? I would guess if the batchbuffer is the first buffer, then it would be the last buffer, otherwise if the batch buffer is the last one, I guess it would be one just before, but that might screw up reloc-data if any of the relocs in the batchbuffer refer to itself. I can also emit the data to a file and close the file before the ioctl and if the ioctl returns, delete said file (assuming a GPU hang always stops the process, then a hang would leave behind a file). 

My vision is that you would attach all "files" to the execbuf, but then again I'm focusing on fdr and not debugging of new features. So long as we are talking about a few megabytes of trace data that isn't too bad. Then we don't have to fiddle around with extra files to find the ones corresponding to the hang, as they will be recorded in the error state. The contents I leave up to you :) (I figure it is a snowball, once a tracing mechanism exists for capturing GPU hangs, there'll be lots of suggestions! One is probably just to capture the aub annotations alongside the batch. Hmm, that might be a good one for me to try just so I can flesh out the fdr mechanism...) -Chris