[PATCH v8 6/6] drm/xe/guc: Plumb GuC-capture into dev coredump

Sat May 11 20:25:05 UTC 2024

On Mon, 2024-05-06 at 18:47 -0700, Zhanjun Dong wrote:
> Add xe_hw_engine_snapshot_from_capture to take snapshot from capture
> node list.
alan: maybe better to describe this more accurately such as
"Add xe_hw_engine_snapshot_from_capture to find the matching
guc-error-capture node that was extraced before hand, and
take a snapshot of the registers of that node"

> Add data struct to map register to a snapshot field, although all
> field is mapped now, which means the offset could be optimized out,
> while in the future, depends on system configuration, the field might
> not be consecutive, keep the offset is reserved for future.
alan: this description needs more clarity. it makes absolute sense only
AFTER understanding the code which i feel isn't good enough for the
purpose of a commit message. If i may offer a suggestion ->

"Repurpose 'regs' substruct from 'struct xe_hw_engine_snapshot'
giving its own type along with additional helper structs
(__reg_map_descr / __reg_map_descr_64) for the purpose of enabling
an immediate lookup of a specific register store within regs struct
based on its location. Although one could argue this is not needed
because the current register mmio-address-offsets are sequential
within the hw mmio-bar, this won't scale for class-registers
or global-registers that are spaced out"

alan:snip
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_capture.c b/drivers/gpu/drm/xe/xe_guc_capture.c
> index 42aae4d99514..959d318f8a6f 100644
> --- a/drivers/gpu/drm/xe/xe_guc_capture.c
> +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
> @@ -37,28 +37,33 @@
>   *       from the engine-mmio-base
>   */
>  #define COMMON_XELP_BASE_GLOBAL \
> -       { FORCEWAKE_GT,             0,      0, "FORCEWAKE" }
> +       { FORCEWAKE_GT,                         0,      0}
alan: took me a while to realize we actually dont even seem to be
printing out FORCEWAKE_GT register value (via devcoredump) nor any
global or class regisers from the xe_guc_capture lists.
See my additional comments with details further down at function
'guc_capture_find_ecode'
alan:snip 
>  #define COMMON_BASE_ENGINE_INSTANCE \
> -       { RING_ESR(0),              0,      0, "ESR" }, \
> -       { RING_EMR(0),              0,      0, "EMR" }, \
> -       { RING_EIR(0),              0,      0, "EIR" }, \
alan:snip
> +       { RING_ESR(0),                          0,      0}, \
> +       { RING_EMR(0),                          0,      0}, \
> +       { RING_EIR(0),                          0,      0}, \
alan:snip
>  /* XE_LP Global */
>  static const struct __guc_mmio_reg_descr xe_lp_global_regs[] = {
> @@ -207,7 +212,6 @@ static void __fill_ext_reg(struct __guc_mmio_reg_descr *ext,
>         ext->reg = XE_REG(extlist->reg.__reg.addr);
>         ext->flags = FIELD_PREP(GUC_REGSET_STEERING_GROUP, slice_id);
>         ext->flags |= FIELD_PREP(GUC_REGSET_STEERING_INSTANCE, subslice_id);
> -       ext->regname = extlist->name;
alan: as per offline conversation, we are never using "regname" anywhere
in this series (unlike i915) because regs structure in xe_hw_engine.c
provides that. Here we are rightfully removing that redundancy.
That said, why dont we just not even introduce regname in Patch1 to
begin with (which gets initialized, unused and then thrown away
in this patch, which i think breaks linux kernel patching rules?).
alan:snip

> +static void
> +guc_capture_free_list(struct drm_device *drm, struct list_head  *list)
> +{
> +       struct __guc_capture_parsed_output *n, *ntmp;
> +
> +       list_for_each_entry_safe(n, ntmp, list, link)
> +               guc_capture_delete_one_node(drm, n);
alan: i think we are not realizing the proper usage of the pre-allocated
cachelist of nodes. in the big picture:
1. At init, create a pool of nodes for use - we keep this in "cachelist".
   (each node being big enough to hold a capture-snapshot from guc)
2. At g2h-capture-notify time, we extract guc reported capture-snapshots
   from the guc-capture ring buffer. For each snapshot we extract,
   we grab a free node from "cachelist" and then move it into "outlist".
3. Later during the devcoredump generation, we find the mathing node from
   "outlist" and copy all the register values into xe_hw_engine's
   snapshot-regs to sent to drm_printer.
4. So at this point, that node can be discarded, however,
   above guc_capture_free_list is calling guc_capture_delete_one_node which
   frees the allocation. This is wrong, instead what we should be removing
   the node from the "outlist" and putting it back into the "cachelist" with
   a memset. Else we'll quickly find ourselves running out of pre-allocated
   node list defeating it's purpose (i.e. finding ourselves having to
   allocate memory in mid reset). See where "guc_capture_add_node_to_cachelist"
   is called from in i915 for what i mean. (of course, as per prior patch
   comment, we do need to verify with xe's core-mm folks if we even still
   have the same requirement for xe on not doing mem allocs mid-reset).
alan:snip

> @@ -959,6 +960,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>         struct xe_sched_job *tmp_job;
>         struct xe_exec_queue *q = job->q;
>         struct xe_gpu_scheduler *sched = &q->guc->sched;
> +       struct xe_gt *gt = q->gt;
alan: why this unrelated change above. lets just focus on guc error capture,
we can do cleanups like this in a standalone patch - assuming its even
required.
alan:snip

> +struct __reg_map_descr capture_engine_reg[] = {
> +       {offsetof(struct snap_shot_regs, ring_hwstam),  "HWSTAM",       RING_HWSTAM(0)  },
> +       {offsetof(struct snap_shot_regs, ring_hws_pga), "RING_HWS_PGA", RING_HWS_PGA(0) },
> +       {offsetof(struct snap_shot_regs, ring_start),   "RING_START",   RING_START(0)   },
> +       {offsetof(struct snap_shot_regs, ring_head),    "RING_HEAD",    RING_HEAD(0)    },
> +       {offsetof(struct snap_shot_regs, ring_tail),    "RING_TAIL",    RING_TAIL(0)    },
> +       {offsetof(struct snap_shot_regs, ring_ctl),     "RING_CTL",     RING_CTL(0)     },
> +       {offsetof(struct snap_shot_regs, ring_mi_mode), "RING_MI_MODE", RING_MI_MODE(0) },
> +       {offsetof(struct snap_shot_regs, ring_mode),    "RING_MODE",    RING_MODE(0)    },
> +       {offsetof(struct snap_shot_regs, ring_imr),     "RING_IMR",     RING_IMR(0)     },
> +       {offsetof(struct snap_shot_regs, ring_esr),     "RING_ESR",     RING_ESR(0)     },
> +       {offsetof(struct snap_shot_regs, ring_emr),     "RING_EMR",     RING_EMR(0)     },
> +       {offsetof(struct snap_shot_regs, ring_eir),     "RING_EIR",     RING_EIR(0)     },
> +       {offsetof(struct snap_shot_regs, ipehr),        "IPEHR",        RING_IPEHR(0)   },
> +       {offsetof(struct snap_shot_regs, rcu_mode),     "RCU_MODE",     RCU_MODE        },
> +};
alan: so as per the lengthly offline conversation we had, we really dont like
the idea of keeping a separate list in snap_shot_regs in xe_hw_engine for the
drm_printer use along with another list in guc-error-capture. Where the former
is a catch all bucket for all registers for any given capture case on any
engine while the latter is organized differently as list of lists with global,
class and instance groupings per gen.

However, considering we a much more critical need to address now, let's capture
a separate offline JIRA task for a future patch to resolve the "flat-catch-all"
attribute of snap_shot_regs vs the "organized category of lists" in guc-error-
capture so we could enable sharing the same static table instances across them.

For now, lets solve the immediate problem: GuC is the default submission
mechanism on xe and yet we dont have a coherent way to report hw register dumps
during an engine reset triggered by GuC.

> +static void cp_reg_to_snapshot(u32 offset, u32 value, struct snap_shot_regs *regs)
> +{
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(capture_engine_reg); i++)
> +               if (offset == capture_engine_reg[i].reg.addr) {
> +                       u32 *field = (u32 *)((uintptr_t)regs + capture_engine_reg[i].dst_offset);
> +                       *field = value;
> +                       return;
> +               }
alan:snip
> +}
> +
> +static void guc_capture_find_ecode(struct __guc_capture_parsed_output *node,
> +                                  struct xe_hw_engine_snapshot *snapshot)
> +{
alan:snip
> +       reginfo = node->reginfo + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE;
alan: so it looks like we are only ever calling cp_reg_to_snapshot for
"LIST_TYPE_ENGINE_INSTANCE", However, as per our offline conversation, this is
incorrect... we should be looking for all register type lists.
> +       regs = reginfo->regs;
> +       for (i = 0; i < reginfo->num_regs; i++)
> +               cp_reg_to_snapshot(regs[i].offset, regs[i].value, &snapshot->reg);
> +}
alan:snip
> +void
> +xe_hw_engine_snapshot_from_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot *snapshot)
> +{
> +       struct xe_gt *gt = hwe->gt;
> +       struct xe_guc *guc = &gt->uc.guc;
> +       struct __guc_capture_parsed_output *n, *ntmp;
> +
> +       if (list_empty(&guc->capture->outlist))
> +               return xe_hw_engine_snapshot_from_engine(hwe, snapshot);
alan: This looks partially wrong, we cannot be manually snapshottting engine registers after the
fact when we ARE actually using guc submission with guc error capture. I think we
can check if (the recently merged) "wedgedmode is >=2" (i.e. GuC will NOT initiate
engine resets) and if so, then only do the manual xe_hw_engine_snapshot_from_engine.
else if 'list is empty' AND 'xe->wedged.mode < 2', then print a warning. This might
indicate we have a bug in guc-err-capture extraction code of we have leaked away
our preallocated cachelist of nodes. Btw, if you added above code path because you
DID actually see us running out of preallocated nodes, then i have already commented
above at guc_capture_free_list that you are not properly using the "cachelist<->outlist"
concept properly which would explain why we ran out.
> +
> +       /*
> +        * Look for a matching GuC reported error capture node from
> +        * the internal output link-list based on engine class and instance.
> +        */
> +       list_for_each_entry_safe(n, ntmp, &guc->capture->outlist, link) {
> +               u32 hwe_guc_class = xe_engine_class_to_guc_class(hwe->class);
> +
> +               if (n->eng_class == hwe_guc_class && n->eng_inst == hwe->instance) {
alan: just matching engine class and instance alone? i don't think that is sufficient,
we should also be matching against the LRCA and guc-id to be accurate do we dont end up
with mixing capture snapshots across different back-to-back captures that came
from different contexts.
> +                       guc_capture_find_ecode(n, snapshot);
> +                       list_del(&n->link);
> +                       return;
> +               }
> +       }
> +}

alan: i must say that from a code readibility perspecte the function names dont seem
to tally with what its doing. 
Current flow in this patch:

xe_hw_engine_snapshot_from_capture (find matching node from outlist)
   |---> guc_capture_find_ecode (loop through capture reglist)
           |---> cp_reg_to_snapshot (copy reg-value into snapshot_regs position).

so maybe something where the name matches the code inside it:
xe_hw_engine_find_and copy_guc_capture_snapshot (find matching node from outlist)
   |---> guc_capture_parse_reglist (loop through capture reglist)
           |---> cp_reg_to_snapshot (copy reg-value into snapshot_regs position).

alan: additionally, above code locations means that xe_hw_engines.c
needs to be aware of guc-capture internals (like what outlist is and
what its node contents are). However, I'm not sure if that is the
allowed practice in xe's inter-subsystem access design/guidance. I would
have thought its better to keep guc-capture self-contained and move above
3 functions into guc-capture exporting only first function. This would
mean guc-capture needs to be aware of struct snapshot usage. I'm
unfamiliar with the design rules so lets take this offline with others.

alan:snip
> +void
> +xe_hw_engine_snapshot_from_engine(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot *snapshot)
> +{
> +       int i;
> +
> +       /* Skip RCU_MODE, the last print */
> +       for (i = 0; i < ARRAY_SIZE(capture_engine_reg) - 1; i++) {
> +               u32 *field = (u32 *)((uintptr_t)&snapshot->reg + capture_engine_reg[i].dst_offset);
> +               *field = hw_engine_mmio_read32(hwe, capture_engine_reg[i].reg);
> +       }
> +       for (i = 0; i < ARRAY_SIZE(capture_engine_reg_64); i++) {
> +               u64 *field = (u64 *)((uintptr_t)&snapshot->reg +
> +                                     capture_engine_reg_64[i].dst_offset);
> +               *field = hw_engine_mmio_read32(hwe, capture_engine_reg_64[i].reg_lo) |
> +                        (u64)hw_engine_mmio_read32(hwe, capture_engine_reg_64[i].reg_hi) << 32;
> +       }
> +       if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
> +               snapshot->reg.rcu_mode = xe_mmio_read32(hwe->gt, RCU_MODE);
alan: this register effects both compute and render class so might need to fix this.
> +
> +       xe_hw_engine_snapshot_instdone_capture(hwe, snapshot);
> +}
> +

alan:snip
> @@ -990,29 +1117,25 @@ void xe_hw_engine_snapshot_print(struct xe_hw_engine_snapshot *snapshot,
>                    snapshot->logical_instance);
>         drm_printf(p, "\tForcewake: domain 0x%x, ref %d\n",
>                    snapshot->forcewake.domain, snapshot->forcewake.ref);
> 

> +       for (i = 0;
> +            /* Skip RCU_MODE, will be processed later */
> +            i < ARRAY_SIZE(capture_engine_reg) - 1;
> +            i++) {
> +               u32 *field = (u32 *)((uintptr_t)&snapshot->reg + capture_engine_reg[i].dst_offset);
> +
> +               drm_printf(p, "\t%s: 0x%08x\n", capture_engine_reg[i].regname, *field);
> +       }
> +
> +       for (i = 0; i < ARRAY_SIZE(capture_engine_reg_64); i++) {
> +               u64 *field = (u64 *)((uintptr_t)&snapshot->reg +
> +                                    capture_engine_reg_64[i].dst_offset);
> +
> +               drm_printf(p, "\t%s: 0x%016llx\n", capture_engine_reg_64[i].regname, *field);
> +       }
>         xe_hw_engine_snapshot_instdone_print(snapshot, p);
>  
> +       /* Last RCU_MODE print */
alan: as mentioned in my other comment in this patch (and as per
offline converation) i feel like rcu mode really should also be
captured by guc. But i also realize now that maybe reading this
register directly is okay if this register never has any status bits
that changes depending on workload and can never change at runtime ...
OR... if GUC doesnt allow capturing (which i doubt).
Let me know if i got this wrong.

>         if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
alan: based on the current hw spec, this register has shared impact
across both CCS and RCS engine classes so this check should really
be fore "if render or compute class"
>                 drm_printf(p, "\tRCU_MODE: 0x%08x\n",
>                            snapshot->reg.rcu_mode);
> diff --git a/drivers/gpu/drm/xe/xe_hw_engine.h b/drivers/gpu/drm/xe/xe_hw_engine.h
> index 71968ee2f600..937ce20ea8de 100644
> --- a/drivers/gpu/drm/xe/xe_hw_engine.h
> +++ b/drivers/gpu/drm/xe/xe_hw_engine.h
> @@ -62,6 +62,10 @@ void xe_hw_engine_print(struct xe_hw_engine *hwe, struct drm_printer *p);
>  void xe_hw_engine_setup_default_lrc_state(struct xe_hw_engine *hwe);
>  
>  bool xe_hw_engine_is_reserved(struct xe_hw_engine *hwe);
> +void xe_hw_engine_snapshot_from_engine(struct xe_hw_engine *hwe,
> +                                      struct xe_hw_engine_snapshot *snapshot);
> +void xe_hw_engine_snapshot_from_capture(struct xe_hw_engine *hwe,
> +                                       struct xe_hw_engine_snapshot *snapshot);
>  static inline bool xe_hw_engine_is_valid(struct xe_hw_engine *hwe)
>  {
>         return hwe->name;
> diff --git a/drivers/gpu/drm/xe/xe_hw_engine_types.h b/drivers/gpu/drm/xe/xe_hw_engine_types.h
> index 9f9755e31b9f..4433c0c8ffc2 100644
> --- a/drivers/gpu/drm/xe/xe_hw_engine_types.h
> +++ b/drivers/gpu/drm/xe/xe_hw_engine_types.h
> @@ -150,6 +150,99 @@ struct xe_hw_engine {
>         struct xe_hw_engine_class_intf *eclass;
>  };
>  
> +/**
> + * struct __reg_map_descr - Mapping table, defines a 32 bit register and corresponding data field
> + *
> + * Contains the defines of a 32 bit register, and the offset in the capture snapshot.
> + */
> +struct __reg_map_descr {
> +       /** @dst_offset: Offset in snap_shot_regs structure */
alan: can we not name it "dst_offset" since that sounds too much like what
we use for register or memory bar offsets, while this is more like a
"position_in_snapshot". my comment applies for "__reg_map_descr_64" as well
> +       u32 dst_offset;
> +       /** @regname: Name of register */
> +       const char *regname;
> +       /** @reg: Hardware register */
> +       struct xe_reg reg;
> +};
alan:snip
> +/**
> + * struct xe_hw_engine_snapshot - Hardware engine snapshot
> + *
> + * Contains the snapshot of useful hardware engine info and registers.
> + */
> +struct snap_shot_regs {
alan: i believe "snapshot" should be one word, if we wanna align with
dictionary spelling. (hope i didnt get this wrong?)
alan:snip
> +       u32 ipehr;
> +       /** @rcu_mode: RCU_MODE */
> +       u32 rcu_mode;
alan: this really should be a class or global register, but i
guess it might still match up and get found from guc-capture
if its located in the category of being an "engine instance"
register. However, as per discussed offline, let's continue
to make snapshot struct a flat-catch-all-reg-list and correspondingly
in guc_capture_find_ecode above, we should go through all register
list types when copying register values into snapshot struct.
That said, i notice you dont have forcewake register here.
> +       struct {
> +               /** @reg.instdone.ring: RING_INSTDONE */
> +               u32 ring;
> +               /** @reg.instdone.slice_common: SC_INSTDONE */
> 
alan:snip
> +               u32 *row;
> +               /** @reg.instdone.geom_svg: INSTDONE_GEOM_SVGUNIT */
> +               u32 *geom_svg;
> +       } instdone;
alan: nit: i beleive, given the purpose of this substruct "separation"
this substruct should be called "steered-regs" not "instdone" (if future
steered just happent to get named without "INSTDONE" in them. However
since this name is unrelated to this series, i'll leave it a nit.

alan:snip
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
> index cd8a2fba5438..922cd1a8753f 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job.c
> +++ b/drivers/gpu/drm/xe/xe_sched_job.c
alan:snip
> @@ -191,10 +193,11 @@ void xe_sched_job_destroy(struct kref *ref)
>  {
>         struct xe_sched_job *job =
>                 container_of(ref, struct xe_sched_job, refcount);
> +       struct xe_device *xe = job_to_xe(job);
>  
>         if (unlikely(job->q->flags & EXEC_QUEUE_FLAG_KERNEL))
> -               xe_pm_runtime_put(job_to_xe(job));
> -       xe_exec_queue_put(job->q);
> +               xe_pm_runtime_put(xe);
alan: sorry, i'm confused, but why is this code changing?
seems unrleated to guc-error-capreut to me. if we are NOT doing the
xe_exec_queue_put here (so that certain information like lrca + guc-id
+ engine-instance/class is "held" until we can print the dev-core-dump,
then we should be relocating this call to elsewhere after that sequence)
so this is either an unnecessary change ... OR ... we have missed
the "put" that should still go somewhere else and potentially leaking
now? let me know if i am misunderstanding this patch.

> 

> +
>         dma_fence_put(job->fence);
>         drm_sched_job_cleanup(&job->drm);
>         job_free(job);