[PATCH v7 1/7] drm/xe/guc: Update GuC ADS size for error capture

Teres Alexis, Alan Previn alan.previn.teres.alexis at intel.com
Thu Apr 18 08:19:29 UTC 2024


On Wed, 2024-03-27 at 13:40 -0700, Zhanjun Dong wrote:
> Update GuC ADS size allocation to include space for
> the lists of error state capture register descriptors.
> 
> Then, populate GuC ADS with the lists of registers we want
> GuC to report back to host on engine reset events. This list
> should include global, engine-class and engine-instance
> registers for every engine-class type on the current hardware.
> 
alan:snip

> diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c
> index df2bffb7e220..abc0866bf22c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ads.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
alan:snip
>  
>  static size_t guc_ads_capture_size(struct xe_guc_ads *ads)
>  {
> -       /* FIXME: Allocate a proper capture list */
> -       return PAGE_ALIGN(PAGE_SIZE);
> +       return PAGE_ALIGN(ads->capture_size);
alan: nit: i believe the functions that calculate the size already
does the PAGE_ALIGN-ment, so this is a redundant piece of code. I
was thinking if we might want to test for alignment here - but again
seems redundant since the functions that generate the sizes are
not taking any external inputs.


> @@ -260,6 +263,34 @@ static size_t calculate_golden_lrc_size(struct xe_guc_ads *ads)
>         return total_size;
>  }
>  
> +static size_t calculate_capture_worst_size(struct xe_guc_ads *ads)
alan: since this file has so many parts of ADS, may i suggest we rename this to "calculate_guc_regs_capture_worst_size"?
> +{
> +       struct xe_guc *guc = ads_to_guc(ads);
> +       size_t total_size, class_size, instance_size, global_size;
> +       int i, j;
> +
> +       /* Early calcuate the capture size, to reserve capture size before guc init finished,
> +        * as engine mask is not ready, the calculate here is the worst case size
> +        */
alan: if i may propose some rephrasing of the above comment:
"this function calculates the worst case register lists size by
including all posible engines classes. It is called during the
first of a two-phase GuC (and ADS-population) initialization
sequence, that is, during the pre-hwconfig phase before we have
the exact engine fusing info."

> +       total_size = PAGE_SIZE; /* Pad a page in front for empty lists */
> +       for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> +               for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
> +                       class_size = 0;
> +                       instance_size = 0;
> +                       xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +                                                  j, &class_size);
> 
alan: if for whatever reason xe_guc_capture_getlistsize fails, i see
code paths in that function that will not initialize the last param (class_size), in which case total_size would contain
garbage.
Best to check the return value to ensure valid size was returned...
OR... initialize class_size, instance_size, global_size above.
I would prefer the former since a worst case calculation should really
get valid sizes for all engine classes and global lise, even if it means an empty list which is already designed into
the guc error
capture subsystem's register-list tables.

> +                       xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +                                                  j, &instance_size);
> +                       total_size += class_size + instance_size;
> +               }
> +               global_size = 0;
> +               xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &global_size);
> +               total_size += PAGE_ALIGN(global_size);
alan: nit: i could be wrong, but thought we dont need the
PAGE_ALIGN across the two capture-index types... I thought
we only need it for the final size. I.e. we only need to
PAGE_ALIGN on the final return below. Lets sync offline and
refer to the fw spec.
> +       }
> +
> +       return total_size;
> +}
> 
alan:snip
> @@ -302,9 +334,11 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
>         xe_gt_assert(gt, ads->bo);
>  
>         ads->golden_lrc_size = calculate_golden_lrc_size(ads);
> +       ads->capture_size = 0; /* Clear capture_size before run guc_capture_prep_lists */
alan: above line appears to look totally redundant when a reader
looks at the very next line after. I wonder if its better
to modify guc_capture_prep_lists so that is doesn't check
ads->capture_size and always assumes ads's info_map is valid
(since guc_capture_prep_lists is only called post_hwconfig,
thus, we should already have valid engine masks ... and if we
don't then we should add an error message in guc_capture_prep_lists).
> +       ads->capture_size = guc_capture_prep_lists(ads);
>         ads->regset_size = calculate_regset_size(gt);
>  
alan:snip

> -static void guc_capture_list_init(struct xe_guc_ads *ads)
> +static u32 guc_get_capture_engine_mask(struct xe_gt *gt, struct iosys_map *info_map,
> +                                      u32 capture_class)
> +{
> +       struct xe_device *xe = gt_to_xe(gt);
> +       u32 mask;
> +
> +       switch (capture_class) {
> +       case GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE:
> +               mask = info_map_read(xe, info_map, engine_enabled_masks[GUC_RENDER_CLASS]);
> +               mask |= info_map_read(xe, info_map, engine_enabled_masks[GUC_COMPUTE_CLASS]);
> +               break;
> +
alan: nit: Dont need a new line above after the break.
Same thing for the rest of the cases after this one.


alan:snip

> +static int guc_capture_prep_lists(struct xe_guc_ads *ads)
>  {
> +       struct xe_guc *guc = ads_to_guc(ads);
> +       struct xe_gt *gt = ads_to_gt(ads);
> +       u32 ads_ggtt, capture_offset, null_ggtt, total_size = 0;
> +       struct iosys_map info_map;
> +       bool ads_is_mapped;
> +       size_t size = 0;
> +       void *ptr;
>         int i, j;
> -       u32 addr = xe_bo_ggtt_addr(ads->bo) + guc_ads_capture_offset(ads);
>  
> -       /* FIXME: Populate a proper capture list */
> +       ads_is_mapped = ads->capture_size != 0;
> +       if (ads_is_mapped) {
alan: as per above comment, we might have no reason to use "ads_is_mapped" by checking "ads->capture_size" and just fail
if ads_ggtt is not valid. Let's discuss offline as i am under
the impression this function is only getting called when we
have valid hwconfig info.

alan:snip


>         for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> -               for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
> -                       ads_blob_write(ads, ads.capture_instance[i][j], addr);
> -                       ads_blob_write(ads, ads.capture_class[i][j], addr);
> +               bool write_empty_list;
> +
> +               for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
alan: Actually, "GUC_LAST_ENGINE_CLASS" is part of a list of
definitions used by GuC for information that is NOT related to
register-capture. This other #define list separates out the values for
render-vs-compute. In the register-capture case, GuC interface spec
uses a single value for render-or-compute. So I would propose that
farther down in this patch where we first add the enum for
"GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE", we should also add
GUC_CAPTURE_MAX_ENGINE_CLASS as 5. This actually looks like a bug
that may still exist in i915 ... it should have been fixed when John
Harrison fixed the ~'using wrong #defines for guc capture engine
classes' a long time back. However, the bug doesnt manifest anything
right now, with the curent definitions we have today, they both end
up being '5'.

> +                       u32 engine_mask = guc_get_capture_engine_mask(gt, &info_map, j);
> +                       /* null list if we dont have said engine or list */
> +                       if (!engine_mask) {
> +                               ads_blob_write(ads, ads.capture_class[i][j], null_ggtt);
> +                               ads_blob_write(ads, ads.capture_instance[i][j], null_ggtt);
> +                               continue;
> +                       }
> +                       /********************************************************/
> +                       /*** engine exists: start with engine-class registers ***/
> +                       /********************************************************/
alan:snip

> new file mode 100644
> index 000000000000..bc6b682998e2
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
> @@ -0,0 +1,300 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021-2022 Intel Corporation
alan:snip..

> +#include "xe_hw_engine_types.h"
alan:nit: xe_hw_engine_types shoiuld come after xe_g*** (alphabetical)
> +#include "xe_gt.h"
> +#include "xe_gt_printk.h"
> +#include "xe_guc.h"
> +#include "xe_guc_capture.h"
> +#include "xe_guc_capture_fwif.h"
> +#include "xe_guc_ct.h"
> +
> +#include "xe_guc_log.h"
> +#include "xe_gt_mcr.h"
> +#include "xe_guc_submit.h"
> +#include "xe_macros.h"
> +#include "xe_map.h"
alan: any reason why above 5 headers are grouped separately?...
why not group together with the group before it (but will also
need alphabetical reshuffling either way).



alan:snip

> +static const struct __guc_mmio_reg_descr_group *
> +guc_capture_get_device_reglist(struct xe_guc *guc)
> +{
> +       //FIXME: add register list
> +       return NULL;
alan: i am not sure if its acceptable to define all these functions that help manage the
register list and extended list when we have not defined what these lists could look like.
I am familiar with the code from i915 but i think another reviewer may have a hard time
trying to understand what reglists[i] could look like without any example of it getting
instanced with populated values. i think it might be better to move the ext-list code into
the 2nd patch but bring the static register table from patch #2. This was how the series was
organized in i915 during its review. Lets connect offline and check with other folks to
be sure this is needed or if its okay. Of course if we do this, then every patch chunk
that accesses "extlists" would also need to move from everywhere else in this patch to #2.

alan:snip


> +int
> +xe_guc_capture_getlist(struct xe_guc *guc, u32 owner, u32 type, u32 classid, void **outptr)
> +{
> 
alan:snip
> +
> +       caplist = kzalloc(size, GFP_KERNEL);
alan:switch to the drmm_kzalloc so its managed.
> +       if (!caplist) {
> +               xe_gt_dbg(guc_to_gt(guc), "Failed to alloc cached register capture list");
> +               return -ENOMEM;
> +       }
> +
> +       /* populate capture list header */
> +       tmp = caplist;
> +       num_regs = guc_cap_list_num_regs(guc->capture, owner, type, classid);
> +       listnode = (struct guc_debug_capture_list *)tmp;
> +       listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, (u32)num_regs);
> +
> +       /* populate list of register descriptor */
> +       tmp += sizeof(struct guc_debug_capture_list);
> +       guc_capture_list_init(guc, owner, type, classid, (struct guc_mmio_reg *)tmp, num_regs);
> +
> +       /* cache this list */
> +       cache->is_valid = true;
> +       cache->ptr = caplist;
> +       cache->size = size;
> +       cache->status = 0;
> +
> +       *outptr = caplist;
> +
> +       return 0;
> +}
> +
> +int
> +xe_guc_capture_getnullheader(struct xe_guc *guc, void **outptr, size_t *size)
> +{
alan:snip


> +
> +       null_header = kzalloc(tmp, GFP_KERNEL);
alan:switch to the drmm_kzalloc so its managed.
> +       if (!null_header) {
> +               xe_gt_dbg(guc_to_gt(guc), "Failed to alloc cached register capture null list");
> +               return -ENOMEM;
> +       }
> +
> +       gc->ads_null_cache = null_header;
> +       *outptr = null_header;
> +       *size = tmp;
> +
> +       return 0;
> +}
> +
> +int xe_guc_capture_init(struct xe_guc *guc)
> +{
> +       guc->capture = kzalloc(sizeof(*guc->capture), GFP_KERNEL);
alan:switch to the drmm_kzalloc so its managed.
> +       if (!guc->capture)
> +               return -ENOMEM;
> +
> +       guc->capture->reglists = guc_capture_get_device_reglist(guc);
> +
> +       INIT_LIST_HEAD(&guc->capture->outlist);
alan: Since parsing and extration of guc produced register dump is only introduced in
patch #5, the typical rule is we shouldnt have any mention of this variable until patch #5
(note even in the main xe_guc_capture struct definition). However, lets check offline with
committers if this is a strict enforcement considering or not... (since this is all
FW-interaction code, identical to i915 and the merge will always come together).
I suspect we'll have to fix it though.

> +       INIT_LIST_HEAD(&guc->capture->cachelist);
> +
> +       return 0;
> +}

alan: Please move xe_guc_capture_destroy from patch #6 to this patch.
It doesnt match up to have the list allocations in this patch and
then have them destroyed in patch #6. Note however that with
drmm_kzalloc, i am not sure u need a destroy but we can at
least have it set 'guc->capture = NULL;' to catch any
post-destruction accesses.

> diff --git a/drivers/gpu/drm/xe/xe_guc_capture.h b/drivers/gpu/drm/xe/xe_guc_capture.h
> new file mode 100644
> index 000000000000..a16dcbe87af0
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_capture.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021-2021 Intel Corporation
> + */
> +
> +#ifndef _XE_GUC_CAPTURE_H
> +#define _XE_GUC_CAPTURE_H
> +
> +#include <linux/types.h>
> +#include "xe_exec_queue_types.h"
alan:why do we need this header?
alan:snip


> diff --git a/drivers/gpu/drm/xe/xe_guc_capture_fwif.h b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
> new file mode 100644
> index 000000000000..4bb94ac1ff48
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
> @@ -0,0 +1,177 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
alan: nit: for each structure definition in this file, we should try to format
the comments/documentation around it to follow the proper linux kernel documentation style.

alan:snip

> +/*
> + * struct guc_debug_capture_list_header / struct guc_debug_capture_list
> + *
> + * As part of ADS registration, these header structures (followed by
> + * an array of 'struct guc_mmio_reg' entries) are used to register with
> + * GuC microkernel the list of registers we want it to dump out prior
> + * to a engine reset.
> + */
> +struct guc_debug_capture_list_header {
> +       u32 info;
> +#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0)
> +} __packed;
> +
> +struct guc_debug_capture_list {
> +       struct guc_debug_capture_list_header header;
> +       struct guc_mmio_reg regs[];
> +} __packed;
> +
> +/*
> + * struct __guc_mmio_reg_descr / struct __guc_mmio_reg_descr_group
> + *
> + * xe_guc_capture module uses these structures to maintain static
> + * tables (per unique platform) that consists of lists of registers
> + * (offsets, names, flags,...) that are used at the ADS regisration
> + * time as well as during runtime processing and reporting of error-
> + * capture states generated by GuC just prior to engine reset events.
> + */
> +struct __guc_mmio_reg_descr {
> +       struct xe_reg reg;
> +       u32 flags;
> +       u32 mask;
> +       const char *regname;
> +};
> +
> +struct __guc_mmio_reg_descr_group {
> +       const struct __guc_mmio_reg_descr *list;
> +       u32 num_regs;
> +       u32 owner; /* see enum guc_capture_owner */
> +       u32 type; /* see enum guc_capture_type */
> +       u32 engine; /* as per MAX_ENGINE_CLASS */
> +       struct __guc_mmio_reg_descr *extlist; /* only used for steered registers */
> +};
> +
> +/*
> + * struct guc_state_capture_header_t / struct guc_state_capture_t /
> + * guc_state_capture_group_header_t / guc_state_capture_group_t
> + *
> + * Prior to resetting engines that have hung or faulted, GuC microkernel
> + * reports the engine error-state (register values that was read) by
> + * logging them into the shared GuC log buffer using these hierarchy
> + * of structures.
> + */
> +struct guc_state_capture_header_t {
> +       u32 owner;
> +#define CAP_HDR_CAPTURE_VFID GENMASK(7, 0)
> +       u32 info;
> +#define CAP_HDR_CAPTURE_TYPE GENMASK(3, 0) /* see enum guc_capture_type */
> +#define CAP_HDR_ENGINE_CLASS GENMASK(7, 4) /* see GUC_MAX_ENGINE_CLASSES */
> +#define CAP_HDR_ENGINE_INSTANCE GENMASK(11, 8)
> +       u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
> +       u32 guc_id; /* if type-instance, context index of hung context, else set to ~0 */
> +       u32 num_mmios;
> +#define CAP_HDR_NUM_MMIOS GENMASK(9, 0)
> +} __packed;
> +
> +struct guc_state_capture_t {
> +       struct guc_state_capture_header_t header;
> +       struct guc_mmio_reg mmio_entries[];
> +} __packed;
> +
> +enum guc_capture_group_types {
> +       GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
> +       GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
> +       GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
> +};
> +
> +struct guc_state_capture_group_header_t {
> +       u32 owner;
alan: If i read the the spec correctly, capture_group_header
also has VFID field at GENMASK(7,0).
> +       u32 info;
> +#define CAP_GRP_HDR_NUM_CAPTURES GENMASK(7, 0)
> +#define CAP_GRP_HDR_CAPTURE_TYPE GENMASK(15, 8) /* guc_capture_group_types */
> +} __packed;
> +
> +/* this is the top level structure where an error-capture dump starts */
> +struct guc_state_capture_group_t {
> +       struct guc_state_capture_group_header_t grp_header;
> +       struct guc_state_capture_t capture_entries[];
> +} __packed;
> +
> 
alan:snip

> diff --git a/drivers/gpu/drm/xe/xe_guc_fwif.h b/drivers/gpu/drm/xe/xe_guc_fwif.h
> index 5474025271e3..ffd3171de65d 100644
> --- a/drivers/gpu/drm/xe/xe_guc_fwif.h
> +++ b/drivers/gpu/drm/xe/xe_guc_fwif.h
> 
> @@ -166,6 +167,8 @@ struct guc_mmio_reg {
>  #define GUC_REGSET_MASKED              BIT(0)
>  #define GUC_REGSET_MASKED_WITH_VALUE   BIT(2)
>  #define GUC_REGSET_RESTORE_ONLY                BIT(3)
> +#define GUC_REGSET_STEERING_GROUP       GENMASK(15, 12)
alan: I've never noticed this before but i think the spec may have changed since the original work because
STEERING_GROUP is now 16..12.
Lets sync offline.


alan:snip

> +/* Class indecies for capture_class and capture_instance arrays */
> +enum {
> +       GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE = 0,
> +       GUC_CAPTURE_LIST_CLASS_VIDEO = 1,
> +       GUC_CAPTURE_LIST_CLASS_VIDEOENHANCE = 2,
> +       GUC_CAPTURE_LIST_CLASS_BLITTER = 3,
> +       GUC_CAPTURE_LIST_CLASS_GSC_OTHER = 4,
alan: as mentioned earlier, this is where we need a
GUC_CAPTURE_LIST_CLASS_MAX = 5.
> +};
> +
>  /* GuC Additional Data Struct */




More information about the Intel-xe mailing list