[PATCH v6 2/6] drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Wed Feb 12 19:25:45 UTC 2025
alan: I will respin this rev with the changes mentiond below - thanks Zhanjun for the time in reviewing this. :)
On Mon, 2025-02-10 at 18:41 -0500, Dong, Zhanjun wrote:
> See my comments inline below.
>
> Regards,
> Zhanjun Dong
>
> On 2025-01-28 1:36 p.m., Alan Previn wrote:
> > GuC-Err-Capture should not be storing register snapshot
> > nodes directly inside of the top level xe_devcoredump_snapshot
> > structure that it doesn't control. Furthermore, that is
> > is not right from a driver subsystem layering perspective.
> >
> > Instead, when a matching GuC-Err-Capture register snapshot is
> > available, store it into xe_hw_engine_snapshot structure.
> >
> > To ensure the manual snapshot can be retrieved and released
> > like the firmware reported snapshot nodes, replace xe_engine_manual_capture
> > xe_guc_capture_snapshot_store_manual_job (which generates
> > and stores the manual GuC-Err-Capture register snapshot
> > within its internal outlist).
> >
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_devcoredump.c | 3 -
> > drivers/gpu/drm/xe/xe_devcoredump_types.h | 6 -
> > drivers/gpu/drm/xe/xe_guc_capture.c | 153 ++++++++++------------
> > drivers/gpu/drm/xe/xe_guc_capture.h | 9 +-
> > drivers/gpu/drm/xe/xe_guc_submit.c | 12 +-
> > drivers/gpu/drm/xe/xe_hw_engine.c | 32 ++---
> > drivers/gpu/drm/xe/xe_hw_engine_types.h | 8 ++
> > 7 files changed, 102 insertions(+), 121 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > index 39fe485d2085..006041997550 100644
> > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > @@ -149,9 +149,6 @@ static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
> > xe_guc_ct_snapshot_free(ss->guc.ct);
> > ss->guc.ct = NULL;
> >
> > - xe_guc_capture_put_matched_nodes(&ss->gt->uc.guc);
> > - ss->matched_node = NULL;
> > -
> sounds like removed the capture_put_matched nodes from
> devcoredump_snapshot_free, rather than that, free it with
> xe_hw_engine_snapshot_free, that's fine.
alan: thats right - from a driver layering perspective the node retrieved from guc-err-cap
is a subset of the xe_hw_engine_snapshot, not the top-level devcore-dump.
Also, as per the series header, guc-err-capture shall not reach up into structures it doesn't
own to put place-hold the matching node handle - i.e. guc-err-cap wont modify
xe_hw_engine_snapshot, .. instead xe_hw_engine shall retrieve the matching node from guc-err-cap
and it shall store it in xe_hw_engine_snapshot which it allocated in the first place.
Thats why the free has moved.
>
> > xe_guc_exec_queue_snapshot_free(ss->ge);
> > ss->ge = NULL;
> >
> > diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > index c94ce21043a8..28486ed93314 100644
> > --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > @@ -53,12 +53,6 @@ struct xe_devcoredump_snapshot {
> > struct xe_hw_engine_snapshot *hwe[XE_NUM_HW_ENGINES];
> > /** @job: Snapshot of job state */
> > struct xe_sched_job_snapshot *job;
> > - /**
> > - * @matched_node: The matched capture node for timedout job
> > - * this single-node tracker works because devcoredump will always only
> > - * produce one hw-engine capture per devcoredump event
> > - */
> > - struct xe_guc_capture_snapshot *matched_node;
> > /** @vm: Snapshot of VM state */
> > struct xe_vm_snapshot *vm;
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_capture.c b/drivers/gpu/drm/xe/xe_guc_capture.c
> > index e04c87739267..f118e8dd0ecb 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_capture.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
> > @@ -1532,35 +1532,18 @@ read_reg_to_node(struct xe_hw_engine *hwe, const struct __guc_mmio_reg_descr_gro
> > }
> > }
> >
> > -/**
> > - * xe_engine_manual_capture - Take a manual engine snapshot from engine.
> > - * @hwe: Xe HW Engine.
> > - * @snapshot: The engine snapshot
> > - *
> > - * Take engine snapshot from engine read.
> > - *
> > - * Returns: None
> > - */
> > -void
> > -xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot *snapshot)
> > +static struct xe_guc_capture_snapshot *
> > +guc_capture_get_manual_snapshot(struct xe_guc *guc, struct xe_hw_engine *hwe)
> > {
> > - struct xe_gt *gt = hwe->gt;
> > - struct xe_device *xe = gt_to_xe(gt);
> > - struct xe_guc *guc = >->uc.guc;
> > - struct xe_devcoredump *devcoredump = &xe->devcoredump;
> > + struct xe_gt *gt = guc_to_gt(guc);
> > enum guc_capture_list_class_type capture_class;
> > const struct __guc_mmio_reg_descr_group *list;
> > struct xe_guc_capture_snapshot *new;
> > enum guc_state_capture_type type;
> > - u16 guc_id = 0;
> > - u32 lrca = 0;
> > -
> > - if (IS_SRIOV_VF(xe))
> > - return;
>
> Sounds like split xe_engine_manual_capture into 2 functions, while
> here lost this SRIOV check, although not a issue as caller check that,
> but looks unbalanced since split. Do you think that will be an issue? I
> mean at later time, when maintanence this code, people might forgot that
> caller need to check this condition.
alan: yeah - you have a point, this function is further reused in patch
5 so i should add back the sriov code above which i lost.
>
> >
> > new = guc_capture_get_prealloc_node(guc);
> > if (!new)
> > - return;
> > + return NULL;
> >
> > capture_class = xe_engine_class_to_guc_capture_class(hwe->class);
> > for (type = GUC_STATE_CAPTURE_TYPE_GLOBAL; type < GUC_STATE_CAPTURE_TYPE_MAX; type++) {
> > @@ -1594,26 +1577,64 @@ xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot
> > }
> > }
> >
> > - if (devcoredump && devcoredump->captured) {
> > - struct xe_guc_submit_exec_queue_snapshot *ge = devcoredump->snapshot.ge;
> > + new->eng_class = xe_engine_class_to_guc_class(hwe->class);
> > + new->eng_inst = hwe->instance;
> >
> > - if (ge) {
> > - guc_id = ge->guc.id;
> > - if (ge->lrc[0])
> > - lrca = ge->lrc[0]->context_desc;
> > - }
> The code removed is in case of devcoredump already captured, we can take
> guc_id and lrca from queue snapshot, no matter of q is killed or not.
> I wonder there might be some corner case between new and old
> implementation. To be clarified.
alan: AS per offline conversation, i wanted to create two separate functions
for retrieving guc-err-cap manual snapshot:
1. an early manual snapshot associated with a job (see below function "xe_guc_capture_snapshot_store_manual_job"
- this will always have a valid queue and only called as part of drm-timedout-job sequence.
2. a raw/late manual shapshot not associated with any job (see patch #5's "xe_guc_capture_snapshot_manual_hwe"
- this will not have a valid queue and is to be called by xe_hw_engine if no match and no queue was
available which is what happens when its called as part of the debugfs or gt-reset sequence.
as per offline conversation its better if I add this info into this patch's comments.
> > + return new;
> > +}
> > +
> > +/**
> > + * xe_guc_capture_snapshot_store_manual_job - Generate and store a manual engine register dump
> > + * @guc: Target GuC for manual capture
> > + * @q: Associated xe_exec_queue to simulate a manual capture on its behalf.
> > + *
> > + * Generate a manual GuC-Error-Capture snapshot of engine instance + engine class registers
> > + * for the engine of the given exec queue. Stores this node in internal outlist for future
> > + * retrieval with the ability to match up against the same queue.
> > + *
> > + * Returns: None
> > + */
> > +void
> > +xe_guc_capture_snapshot_store_manual_job(struct xe_guc *guc, struct xe_exec_queue *q)
> > +{
> > + struct xe_guc_capture_snapshot *new;
> > + struct xe_gt *gt = guc_to_gt(guc);
> > + struct xe_hw_engine *hwe;
> > + enum xe_hw_engine_id id;
> > +
> > + /* we don't support GuC-Error-Capture, including manual captures on VFs */
> > + if (IS_SRIOV_VF(guc_to_xe(guc)))
> > + return;
> > +
> > + if (!q) {
> > + xe_gt_warn(gt, "Manual GuC Error capture requested with invalid job\n");
> Same depents on q alive issue.
> Also, is that a warnning?
alan: yes - i wanted to be sure we don't have incorrect plumbing wired up (across the layers, flows, events)
In V7 i dont see any CI from this and in fact none of the suspected regressions have anything to do with
the series so it looks like we are good. However, i do see your concern in the bigger picture of
racy drm-timeout vs guc-exec-quanta vs user doing ctrl-c. So i'll change this to drm_dbg.
> > + return;
> > }
> >
> > - new->eng_class = xe_engine_class_to_guc_class(hwe->class);
> > - new->eng_inst = hwe->instance;
> > - new->guc_id = guc_id;
> > - new->lrca = lrca;
> > + /* Find hwe for the queue */
> > + for_each_hw_engine(hwe, gt, id) {
> > + if (hwe != q->hwe)
> > + continue;
> > + break;
> > + }
> > + if (hwe != q->hwe) {
> > + xe_gt_warn(gt, "Manual GuC Error capture failed to find matching engine\n");
> Not found, is that a warnning or debug info?
alan: i shall change this to drm_dbg as per above.
> > + return;
> > + }
> > +
> > + new = guc_capture_get_manual_snapshot(guc, hwe);
> > + if (!new)
> > + return;
> > +
> > + new->guc_id = q->guc->id;
> > + new->lrca = xe_lrc_ggtt_addr(q->lrc[0]);
> Same depents on q alive issue.
alan: as per offline conversation and my response above to your comment
about the removal of following hunk:
- if (ge) {
- guc_id = ge->guc.id;
- if (ge->lrc[0])
- lrca = ge->lrc[0]->context_desc;
- }
... in patch 5 we separate a function just to handle jobless snapshots
> > new->is_partial = 0;
> > new->locked = 1;
> > new->source = XE_ENGINE_CAPTURE_SOURCE_MANUAL;
> >
> > guc_capture_add_node_to_outlist(guc->capture, new);
> > - devcoredump->snapshot.matched_node = new;
> > +
> > + return;
> > }
alan:snip
> >
> > @@ -1893,51 +1902,23 @@ xe_engine_snapshot_capture_for_queue(struct xe_exec_queue *q)
> > coredump->snapshot.hwe[id] = NULL;
> > continue;
> > }
> > -
> > - if (!coredump->snapshot.hwe[id]) {
> > - coredump->snapshot.hwe[id] =
> > - xe_hw_engine_snapshot_capture(hwe, q);
> > - } else {
> > - struct xe_guc_capture_snapshot *new;
> > -
> > - new = xe_guc_capture_get_matching_and_lock(q);
> > - if (new) {
> > - struct xe_guc *guc = &q->gt->uc.guc;
> > -
> > - /*
> > - * If we are in here, it means we found a fresh
> > - * GuC-err-capture node for this engine after
> > - * previously failing to find a match in the
> > - * early part of guc_exec_queue_timedout_job.
> > - * Thus we must free the manually captured node
> > - */
> > - guc_capture_free_outlist_node(guc->capture,
> > - coredump->snapshot.matched_node);
> > - coredump->snapshot.matched_node = new;
> > - }
> > - }
> > -
> > - break;
> > + coredump->snapshot.hwe[id] = xe_hw_engine_snapshot_capture(hwe, q);
> Now xe_engine_snapshot_capture_for_queue is seperate with manual capture
> at all, so no more above logic is needed. Nice.
alan: i like the cleaner dedicated helpers :)
More information about the Intel-xe
mailing list