[Intel-gfx] [PATCH v5 07/10] drm/i915/guc: Extract GuC error capture lists on G2H notification.

Fri Feb 25 07:21:19 UTC 2022

Folks, a brief update: over the last few weeks of internal reviews,
testing and debug, another redesign has been implemented for this
patch (the extraction of error capture information). When
experiencing back to back error capture notifications (as part of
multiple dependent engine resets), if a forced full GT reset comes
in at the same time (from either the heartbeat or the user forced
reset debugfs or the igt_core cleanup) GT reset code takes the reset
lock and later calls guc_submission_reset_prepare. This function
flushes the ct processing worker queue for handling G2H events.
intel_guc_capture gets called to extract new error capture data from
the guc-log buffer but is actually in the midst of a reset ..
this causes lockdep issues (the memory shrinker vs reset locks).
Checking for uc->reset_in_progress is racy. That said, the
extraction code (this patch) needs to be modified to never allocate
memory for the output 'engine-instance-capture' node. Redesign is
complete where a pool of blank nodes are allocated up front and
re-used through the life of the driver. That will be part of the
next rev.

..alan

On 2/17/2022 11:21 AM, Umesh Nerlige Ramappa wrote:
> On Sun, Feb 13, 2022 at 11:47:00AM -0800, Teres Alexis, Alan Previn 
> wrote:
>> Thanks Umesh for reviewing the patch.
>> Am fixing all the rest but a couple of comments.
>> Responses to the latter and other questions below:
>>
>> ...alan
>>
>>> > +enum intel_guc_state_capture_event_status {
>>> > +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_SUCCESS = 0x0,
>>> > +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE = 0x1,
>>> > +};
>>> > +
>>> > +#define INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK 0x1
>>>
>>> MASK is not needed. See below
>>
>> Alan: Oh wait, actually the mask for the capture status is 0x000000FF
>> (above is a typo). I'll fix above  mask and shall not change the
>> code below because the upper 24 bits of the first dword of this msg
>> is not defined.
>>
>> ...
>>
>>
>>> > +static int guc_capture_buf_cnt(struct __guc_capture_bufstate *buf)
>>> > +{
>>> > +   if (buf->rd == buf->wr)
>>> > +           return 0;
>>> > +   if (buf->wr > buf->rd)
>>> > +           return (buf->wr - buf->rd);
>>> > +   return (buf->size - buf->rd) + buf->wr;
>>> > +}
>>>
>>> Is this a circular buffer shared between GuC and kmd? Since the size is
>>> a power of 2, the above function is simply:
>>>
>> Alan: not this is not a circular buffer, so I'll keep the above
>> version.
>>> static u32 guc_capture_buf_count(struct __guc_capture_bufstate *buf)
>>> {
>>>       return (buf->wr - buf->rd) & (buf->size - 1);
>>> }
>>>
>>
>> ...
>>
>>> > +static int
>>> > +guc_capture_log_remove_dw(struct intel_guc *guc, struct 
>>> __guc_capture_bufstate *buf,
>>> > +                     u32 *dw)
>>> > +{
>>> > +   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
>>> > +   int tries = 2;
>>> > +   int avail = 0;
>>> > +   u32 *src_data;
>>> > +
>>> > +   if (!guc_capture_buf_cnt(buf))
>>> > +           return 0;
>>> > +
>>> > +   while (tries--) {
>>> > +           avail = guc_capture_buf_cnt_to_end(buf);
>>>
>>> Shouldn't this be avail = guc_capture_buf_cnt(buf)?
>>>
>>
>> Alan : The "guc_capture_log_get_[foo]" functions only call above
>> guc_capture_log_remove_dw when there isnt sufficient space to
>> copy out an entire structure from the space between the read pointer
>> and the end of the subregion (before the wrap-around). Those function
>> would populate the structure dword by dword by calling above func.
>> (NOTE the buffer and all error capture output structs are dword
>> aligned). Thats why above function tries twice and resets buf->rd = 0
>> if we find no space left at the end of the subregion (i.e. need to
>> wrap around) - which can only be done by calling
>> "guc_capture_buf_cnt_to_end".
>>
>> ...
>>
>>> > +
>>> > +   /* Bookkeeping stuff */
>>> > +   guc->log_state[GUC_CAPTURE_LOG_BUFFER].flush += 
>>> log_buf_state_local.flush_to_file;
>>> > +   new_overflow = intel_guc_check_log_buf_overflow(guc,
>>> > + &guc->log_state[GUC_CAPTURE_LOG_BUFFER],
>>> > + full_count);
>>>
>>> I am not sure how the overflow logic works here and whether it is
>>> applicable to the error capture buffer. Is the guc log buffer one big
>>> buffer where the error capture is just a portion of that buffer? If so,
>>> is the wrap around applicable to just the errorcapture buffer or to the
>>> whole buffer?
>>>
>> Alan: Yes, the guc log buffer is one big log buffer but there are 3 
>> independent
>> subregions within that are populated with different content and are used
>> in different ways and timings. Each guc-log subregion (general-logs,
>> crash-dump and error-capture) has it's own read and write pointers.
>
> got it. I would also put this one detail in the commit message since 
> it's not quickly inferred.
>
>>
>>
>>> Also what is the wrap_offset field in struct guc_log_buffer_state?
>>
>> Alan: This is the byte offset of a location in the subregion that is 
>> the 1st byte
>> after the last valid guc entry written by Guc firmware before a 
>> wraparound
>> was done. This would generate a tiny hole at the end of the subregion 
>> for better
>> cacheline alignment when flushing entries into the subregion. However,
>> the error-capture subregion is dword aligned and all of the output 
>> structures
>> used for error-capture are also dword aligned so this can never 
>> happen for the
>> error-capture subregion.
>>
> Makes sense, thanks for clarifying.
>
> Umesh
>>
>>>
> .