[Freedreno] [RFC 0/4] drm/msm: GPU crash state

Rob Clark robdclark at gmail.com
Fri Jan 5 22:51:24 UTC 2018


On Fri, Jan 5, 2018 at 5:11 PM, Jordan Crouse <jcrouse at codeaurora.org> wrote:
> On Fri, Jan 05, 2018 at 06:32:22PM +0000, Chris Wilson wrote:
>> Quoting Jordan Crouse (2018-01-05 18:00:17)
>> > This is a request for comment on code to store and dump a GPU state
>> > a hang with inspiration from the very good i915 GPU error state and
>> > the binary GPU snapshot in the downstream kernel.
>> >
>> > The goal is to store and provide enough information to debug software
>> > and hardware issues on the Adreno hardware in a semi human-readable
>> > format that can also be parsed by scripts.
>> >
>> > The goal for this request for comment is to get some consensus
>> > about the format and work through some of the technical issues.
>>
>> My biggest regret for i915/error is that we didn't adopt a sensible file
>> format and organically grew it from dmesg-style logging. This is quite a
>> hindrance when it comes to trying to improve the capture whilst
>> maintaining compatibility with the existing tools. Switching to json/yaml
>> at this point won't be too difficult to spot the change in format, just a
>> large chunk of technical debt to pay off. So I would recommend you pick a
>> an adaptable, human readable, file format for ease of tool development.
>
> This is a really great suggestion. The downstream qcom kernel uses a strictly
> binary format which is also problematic for other reasons. I like the idea of
> having something standard and extensible while remaining human readable without
> tools.
>
>> The second important feature for capturing error state is to include as
>> much user information as possible. You want to be able to identify which
>> library generated the hang in a post-mortem dump from a user in 6-12
>> months time, and just as importantly, why the library did what it did. I
>> like the idea of userspace being able to attach buffers that are
>> included in the error state (supplied as auxiliary information to the
>> guilty command stream) to provide a flight-data-recorder from the user's
>> pov. So design your interface with a view to extending to include blobs.
>
> I love the ascii85 and compression stuff that i915 does and that would fit in
> well a nice file format as well.

I guess I should dig out from under a pile of snow and unread email
and read your patches.. but if you don't already, including at least
the full cmdline of the process that triggered the hang would be
hugely useful.  This was a big improvement when I added it to
hangcheck dbg msgs and hangrd (ie. otherwise 90% of piglit crashes
just showed up as "shader_runner" or something equally generic)..

maybe it is possible to capture more of /proc/$pid/* ?  I guess that
is a good area where code could be shared across drivers (ie. generic
sections for /proc/$pid/cmdline and /proc/$pid/maps and whatever else,
plus driver specific sections for various
register/cmdstream/debug-state dumps)

Perhaps buffer-snapshots could be shared too (at least as far as file
format).. I guess usefulness to other drivers depends on how much
cmdstream has pointers to other useful stuff (like more cmdstream or
shaders).  This is probably the sorta thing that should be tunable
since sometimes capturing every bo ref'd by the submit could be a bit
much.

I'd definitely be open to extending my cmdstream parsing tools to a
new format..  the "rd" format was pretty organically grown (although
the type/length/value binary format gave enough extensibility that I
didn't need to throw it away yet).. but I could definitely see the
value in something more human readable yet still parsable to feed into
cmdstream/register parsing.

BR,
-R


>> It would be interesting to have a common file format... While
>> interpreting the data is going to highly specific to a gpu/driver, the
>> data itself will be similar between drivers. If we had a common file
>> format, we could extend something like mesa/intel/aubinator_error_decode
>> and throw in a bunch of xml descriptors for the different gpus. Just a
>> thought...
>
> I'm definitely open to this. There is never anything wrong with improved
> debugging for everybody.
>
> Thanks,
> Jordan
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


More information about the Freedreno mailing list