[PATCH v9 4/4] drm/xe/guc: Extract GuC capture lists to register snapshot

Wed Jun 19 20:17:53 UTC 2024

See my comments below.

Regards,
Zhanjun

On 2024-06-13 7:26 p.m., Teres Alexis, Alan Previn wrote:
> alan: Hi there Zhanjun, as per offline conversation, i was surprised to
> see you squash the "Guc-error-capture-extraction" together with the
> "Plumb GuC-capture into dev coredump" patch but now realize i made a
> regrettable typo in my rev8 review comments - when i mentioned about
> posisbly squashing pre-allocate-nodes with node-extraction i said
> "patch 4 and patch 6" but what i meant was "patch 4 and patch 5".
> I'm terribly sorry about this mistake. Since we know that you
> have to redo this patch anyways based on offline conversation to
> resolve race between drm-tdr vs guc-context-reset, thanks for agreeing
> to keep extraction separate again. And ofc we don't need pre-allocating
> of nodes now.
will put patch 4 back
> 
> On Thu, 2024-06-06 at 17:07 -0700, Zhanjun Dong wrote:
>> Upon the G2H Notify-Err-Capture event, parse through the
>> GuC Log Buffer (error-capture-subregion) and generate one or
>> more capture-nodes. A single node represents a single "engine-
>> instance-capture-dump" and contains at least 3 register lists:
>> global, engine-class and engine-instance. An internal link
>> list is maintained to store one or more nodes.
>> Because the link-list node generation happen before the call
>> to devcoredump, duplicate global and engine-class register
>> lists for each engine-instance register dump if we find
>> dependent-engine resets in a engine-capture-group.
>> When xe_devcoredump calls into snapshot_from_capture_engine,

...

>> +
>> +static inline u16 xe_engine_class_to_guc_capture_class(enum
>> xe_engine_class class)
>> +{
>> +       return
>> xe_guc_class_to_capture_class(xe_guc_class_to_capture_class(class));
>> +}
>> +
>>   static inline struct xe_gt *guc_to_gt(struct xe_guc *guc)
>>   {
>>          return container_of(guc, struct xe_gt, uc.guc);
>>
> 
> alan:snip.
> 
>> +static void
>> +__guc_capture_create_prealloc_nodes(struct xe_guc *guc)
>> +{
>> +       struct __guc_capture_parsed_output *node = NULL;
>> +       int i;
>> +
>> +       for (i = 0; i < PREALLOC_NODES_MAX_COUNT; ++i) {
>> +               node = guc_capture_alloc_one_node(guc);
>> +               if (!node) {
>> +                       xe_gt_warn(guc_to_gt(guc), "Register capture
>> pre-alloc-cache failure\n");
>> +                       /* dont free the priors, use what we got and
>> cleanup at shutdown */
>> +                       return;
>> +               }
>> +               guc_capture_add_node_to_cachelist(guc->capture,
>> node);
>> +       }
>> +}
>> +
> btw, why are we still using this pre-allocated-nodes framework and
> functions? i thought we decided we didn't need it for the case of Xe?
> (since we'll serialize the extraction-flow vs any reset-flow vs the
> devcoredump-printing  by putting them on the same workqueue). In that
> case we can remove the prealocated nodes' cachelist extraction code
> requiring new ndes can dynamically by calling
> guc_capture_alloc_one_node directly
Current code is still using the pre-alloced node, I will try dynamic 
alloc later. Another concern is the GFP_ATOMIC may alloc from emergency 
pools, it might not be the better choice for debug purposed feature like 
this.

> 
> alan:snip. same comment above for below hnk.
>> +static void
>> +guc_capture_create_prealloc_nodes(struct xe_guc *guc)
>> +{
>> +       /* skip if we've already done the pre-alloc */
>

...

>>   /**
>>    * xe_hw_engine_snapshot_capture - Take a quick snapshot of the HW
>> Engine.
>>    * @hwe: Xe HW Engine.
>> @@ -839,8 +899,12 @@ struct xe_hw_engine_snapshot *
>>   xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe)
>>   {
>>          struct xe_hw_engine_snapshot *snapshot;
>> +       struct xe_gt *gt = hwe->gt;
>> +       struct xe_device *xe = gt_to_xe(gt);
>> +       struct xe_guc *guc = &gt->uc.guc;
>>
> alan:snip
> 
>> +       /* Check GuC settings, job is set and capture outlist not
>> empty,
>> +        * otherwise take it from engine
>> +        */
>> +       if (xe_device_uc_enabled(xe) && xe->wedged.mode >= 1 &&
>> +           !list_empty(&guc->capture->outlist) && xe-
>>> devcoredump.job)
>> +               xe_hw_engine_find_and_copy_guc_capture_snapshot(hwe,
>> snapshot);
>> +       else
>> +               xe_hw_engine_snapshot_from_hw(hwe, snapshot);
> 
> alan: in the above if-else check, we are assuming that guc's capture
> list not being empty means we have the captured node for this
> devcoredump triger... however, in larger scale platforms we could have
> multiple captured nodes from multiple engines so we should not
> be relying on just whether its empty or not, rather the check should
> call guc capture to find matching node based on lrc, guc-id, engnine
> etc (i.e. basically call
> xe_hw_engine_find_and_copy_guc_capture_snapshot directly to see if we
> have a dump.)
Good point, I will add the capture ready for job functions in next rev.

> 
> alan:snip.
> 
> I'll review the rest of this patch after rev9 is out since there are
> other changes incoming as per my first review comments on top.