[PATCH v9 4/4] drm/xe/guc: Extract GuC capture lists to register snapshot
Dong, Zhanjun
zhanjun.dong at intel.com
Wed Jun 19 20:17:53 UTC 2024
See my comments below.
Regards,
Zhanjun
On 2024-06-13 7:26 p.m., Teres Alexis, Alan Previn wrote:
> alan: Hi there Zhanjun, as per offline conversation, i was surprised to
> see you squash the "Guc-error-capture-extraction" together with the
> "Plumb GuC-capture into dev coredump" patch but now realize i made a
> regrettable typo in my rev8 review comments - when i mentioned about
> posisbly squashing pre-allocate-nodes with node-extraction i said
> "patch 4 and patch 6" but what i meant was "patch 4 and patch 5".
> I'm terribly sorry about this mistake. Since we know that you
> have to redo this patch anyways based on offline conversation to
> resolve race between drm-tdr vs guc-context-reset, thanks for agreeing
> to keep extraction separate again. And ofc we don't need pre-allocating
> of nodes now.
will put patch 4 back
>
> On Thu, 2024-06-06 at 17:07 -0700, Zhanjun Dong wrote:
>> Upon the G2H Notify-Err-Capture event, parse through the
>> GuC Log Buffer (error-capture-subregion) and generate one or
>> more capture-nodes. A single node represents a single "engine-
>> instance-capture-dump" and contains at least 3 register lists:
>> global, engine-class and engine-instance. An internal link
>> list is maintained to store one or more nodes.
>> Because the link-list node generation happen before the call
>> to devcoredump, duplicate global and engine-class register
>> lists for each engine-instance register dump if we find
>> dependent-engine resets in a engine-capture-group.
>> When xe_devcoredump calls into snapshot_from_capture_engine,
...
>> +
>> +static inline u16 xe_engine_class_to_guc_capture_class(enum
>> xe_engine_class class)
>> +{
>> + return
>> xe_guc_class_to_capture_class(xe_guc_class_to_capture_class(class));
>> +}
>> +
>> static inline struct xe_gt *guc_to_gt(struct xe_guc *guc)
>> {
>> return container_of(guc, struct xe_gt, uc.guc);
>>
>
> alan:snip.
>
>> +static void
>> +__guc_capture_create_prealloc_nodes(struct xe_guc *guc)
>> +{
>> + struct __guc_capture_parsed_output *node = NULL;
>> + int i;
>> +
>> + for (i = 0; i < PREALLOC_NODES_MAX_COUNT; ++i) {
>> + node = guc_capture_alloc_one_node(guc);
>> + if (!node) {
>> + xe_gt_warn(guc_to_gt(guc), "Register capture
>> pre-alloc-cache failure\n");
>> + /* dont free the priors, use what we got and
>> cleanup at shutdown */
>> + return;
>> + }
>> + guc_capture_add_node_to_cachelist(guc->capture,
>> node);
>> + }
>> +}
>> +
> btw, why are we still using this pre-allocated-nodes framework and
> functions? i thought we decided we didn't need it for the case of Xe?
> (since we'll serialize the extraction-flow vs any reset-flow vs the
> devcoredump-printing by putting them on the same workqueue). In that
> case we can remove the prealocated nodes' cachelist extraction code
> requiring new ndes can dynamically by calling
> guc_capture_alloc_one_node directly
Current code is still using the pre-alloced node, I will try dynamic
alloc later. Another concern is the GFP_ATOMIC may alloc from emergency
pools, it might not be the better choice for debug purposed feature like
this.
>
> alan:snip. same comment above for below hnk.
>> +static void
>> +guc_capture_create_prealloc_nodes(struct xe_guc *guc)
>> +{
>> + /* skip if we've already done the pre-alloc */
>
...
>> /**
>> * xe_hw_engine_snapshot_capture - Take a quick snapshot of the HW
>> Engine.
>> * @hwe: Xe HW Engine.
>> @@ -839,8 +899,12 @@ struct xe_hw_engine_snapshot *
>> xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe)
>> {
>> struct xe_hw_engine_snapshot *snapshot;
>> + struct xe_gt *gt = hwe->gt;
>> + struct xe_device *xe = gt_to_xe(gt);
>> + struct xe_guc *guc = >->uc.guc;
>>
> alan:snip
>
>> + /* Check GuC settings, job is set and capture outlist not
>> empty,
>> + * otherwise take it from engine
>> + */
>> + if (xe_device_uc_enabled(xe) && xe->wedged.mode >= 1 &&
>> + !list_empty(&guc->capture->outlist) && xe-
>>> devcoredump.job)
>> + xe_hw_engine_find_and_copy_guc_capture_snapshot(hwe,
>> snapshot);
>> + else
>> + xe_hw_engine_snapshot_from_hw(hwe, snapshot);
>
> alan: in the above if-else check, we are assuming that guc's capture
> list not being empty means we have the captured node for this
> devcoredump triger... however, in larger scale platforms we could have
> multiple captured nodes from multiple engines so we should not
> be relying on just whether its empty or not, rather the check should
> call guc capture to find matching node based on lrc, guc-id, engnine
> etc (i.e. basically call
> xe_hw_engine_find_and_copy_guc_capture_snapshot directly to see if we
> have a dump.)
Good point, I will add the capture ready for job functions in next rev.
>
> alan:snip.
>
> I'll review the rest of this patch after rev9 is out since there are
> other changes incoming as per my first review comments on top.
More information about the Intel-xe
mailing list