[Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.

Wed Dec 8 18:22:17 UTC 2021

After chatting offline with Matt, it became apparent that
i somehow missed the fact that the ctb processing handler
was already in a work queue. That said, Matt is correct, i
dont need to create a work queue to extract that capture
log into the interim-store. That would eliminate the race
condition (when the context-reset notification comes
in later via the same queue).

Thanks again Matt.

....alan

On Tue, 2021-12-07 at 21:15 -0800, Alan Previn Teres Alexis wrote:
> Thanks Matt for reviewing. Responses to the questions you had.
> will fix the rest on next rev.
>  
> > > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> > >  		return -EPROTO;
> > >  	}
> > >  
> > > -	status = msg[0];
> > > -	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> > > +	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> > > +	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> > > +		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
> > >  
> > > -	/* Add extraction of error capture dump */
> > > +	intel_guc_capture_store_snapshot(guc);
> > 
> > This is done in different worker, right? How does this not race with an
> > engine reset notification that does an error capture (e.g. the error
> > capture is done before we read out the info from the GuC)?
> > 
> I believe the guc error state capture notification event comes right before
> context resets, not engine resets. Also, the i915_gpu_coredump subsystem
> gathers hw state in response to the a context hanging and is done prior to
> the hw reset. Therefore, engine reset notification doesnt play a role here.
> However, since the context reset notification is expected to come right after
> the error state capture notification and your argument is valid in this case...
> you could argue a race condition can exist when the context reset event leads
> to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to
> the intel_guc_capture module), however even here, no actual parsing is done
> yet - i am currently waiting on the actual debugfs call before parsing any
> of the data. As a fix, However, I add a flush_work at the time the call to
> the parsing happens even later?
> 
> 
> > As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
> > memory so I don't think we need a worker here.
> > 
> Yes, i am not doing any allocation but the worker function does lock the
> capture_store's mutex (to ensure we dont trample on the circular buffer pointers
> of the interim store (the one between the guc-log-buffer and the error-capture
> reporting). I am not sure if spin_lock_irqsave would be safe considering in the
> event we had back to back error captures, then we wouldnt want to be spinning that
> long if coincidentially the reporting side is actively parsing the bytestream 
> output of the same interim buffer.
> 
> After thinking a bit more, a lockless solution is possible, i could double
> buffer the interim-store-circular-buffer and so when the event comes in, i flip
> to the next "free" interim-store (that isnt filled with pending logs waiting
> to be read or being read). If there is no available interim-store, (i.e.
> we've already had 2 error state captures that have yet to be read/flushed), then
> we just drop the output to the floor. (this last part would be in line with the
> current execlist model.. unless something changed there in the last 2 weeks).
> 
> However the simplest solution from with this series today, would be to flush_work
> much later at the time the debugfs calls to get the output error capture report
> (since that would be the last chance to resolve the possible race condition).
> I could call the flush_work earlier at the context_reset notification, but that too
> would be an irq handler path. 
> 
> > Matt
> > 
> > >  
> > >  	return 0;
> > >  }
> > > -- 
> > > 2.25.1
> > >