[Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Thu Dec 23 18:54:32 UTC 2021
Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:
1. the delay between the context reset notification (when the vmas are copied
and when we verify we had received a guc err capture dump) may be subjectively
large enough and not tethered that the guc-id may have already been re-assigned.
2. I was really looking for some kind of unique context handle to print out that could
be correlated (by user inspecting the dump) back to a unique app or process or
context-id but cant find such a param in struct intel_context.
As part of further reviewing the end to end flows and possible error scenarios, there
also may potentially be a mismatch between "which context was reset by guc at time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time later.
(Recap: First, guc notifies capture event, second, guc notifies context reset during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
Third step is when user triggers the debugfs to dump which is when we parse the error
capture logs.)
As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
for the very first context that was reset. And with guc-submission, that would always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
logs are large enough to hold data for multiple dumps).
The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.
...alan
On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> Thanks again for the detailed review here.
> Will fix all the rest on next rev.
> One special response for this one:
>
>
> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > + if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > + GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > + eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > + eng = guc_lookup_engine(guc, engineclass, eng_inst);
> > > + if (eng) {
> > > + GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> > > + } else {
> > > + PRINT(&i915->drm, ebuf, " i915-Eng-Lookup Fail!\n");
> > > + }
> > > + ce = guc_context_lookup(guc, data.guc_ctx_id);
> >
> > You are going to need to reference count the 'ce' here. See
> > intel_guc_context_reset_process_msg for an example.
> >
>
> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
> development. Sorry about that i just totally missed it from my todo-notes.
>
> ...alan
More information about the Intel-gfx
mailing list