[PATCH v7 1/7] drm/xe/guc: Update GuC ADS size for error capture

Dong, Zhanjun zhanjun.dong at intel.com
Fri Apr 19 16:31:56 UTC 2024


See my comments inline below.

Regards,
Zhanjun

On 2024-04-18 4:19 a.m., Teres Alexis, Alan Previn wrote:
> On Wed, 2024-03-27 at 13:40 -0700, Zhanjun Dong wrote:
>> Update GuC ADS size allocation to include space for
>> the lists of error state capture register descriptors.
>>
>> Then, populate GuC ADS with the lists of registers we want
>> GuC to report back to host on engine reset events. This list
>> should include global, engine-class and engine-instance
>> registers for every engine-class type on the current hardware.
>>
> alan:snip
> 
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c
>> index df2bffb7e220..abc0866bf22c 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ads.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
> alan:snip
>>   
>>   static size_t guc_ads_capture_size(struct xe_guc_ads *ads)
>>   {
>> -       /* FIXME: Allocate a proper capture list */
>> -       return PAGE_ALIGN(PAGE_SIZE);
>> +       return PAGE_ALIGN(ads->capture_size);
> alan: nit: i believe the functions that calculate the size already
> does the PAGE_ALIGN-ment, so this is a redundant piece of code. I
> was thinking if we might want to test for alignment here - but again
> seems redundant since the functions that generate the sizes are
> not taking any external inputs.

Yes, redundant PAGE_ALIGN, will be change to be for total size only.
> 
> 
>> @@ -260,6 +263,34 @@ static size_t calculate_golden_lrc_size(struct xe_guc_ads *ads)
>>          return total_size;
>>   }
>>   
>> +static size_t calculate_capture_worst_size(struct xe_guc_ads *ads)
> alan: since this file has so many parts of ADS, may i suggest we rename this to "calculate_guc_regs_capture_worst_size"?
sure

>> +{
>> +       struct xe_guc *guc = ads_to_guc(ads);
>> +       size_t total_size, class_size, instance_size, global_size;
>> +       int i, j;
>> +
>> +       /* Early calcuate the capture size, to reserve capture size before guc init finished,
>> +        * as engine mask is not ready, the calculate here is the worst case size
>> +        */
> alan: if i may propose some rephrasing of the above comment:
> "this function calculates the worst case register lists size by
> including all posible engines classes. It is called during the
> first of a two-phase GuC (and ADS-population) initialization
> sequence, that is, during the pre-hwconfig phase before we have
> the exact engine fusing info."sure

> 
>> +       total_size = PAGE_SIZE; /* Pad a page in front for empty lists */
>> +       for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
>> +               for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
>> +                       class_size = 0;
>> +                       instance_size = 0;
>> +                       xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
>> +                                                  j, &class_size);
>>
> alan: if for whatever reason xe_guc_capture_getlistsize fails, i see
> code paths in that function that will not initialize the last param (class_size), in which case total_size would contain
> garbage.
> Best to check the return value to ensure valid size was returned...
> OR... initialize class_size, instance_size, global_size above.
> I would prefer the former since a worst case calculation should really
> get valid sizes for all engine classes and global lise, even if it means an empty list which is already designed into
> the guc error
> capture subsystem's register-list tables.
will fix it
> 
>> +                       xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
>> +                                                  j, &instance_size);
>> +                       total_size += class_size + instance_size;
>> +               }
>> +               global_size = 0;
>> +               xe_guc_capture_getlistsize(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &global_size);
>> +               total_size += PAGE_ALIGN(global_size);
> alan: nit: i could be wrong, but thought we dont need the
> PAGE_ALIGN across the two capture-index types... I thought
> we only need it for the final size. I.e. we only need to
> PAGE_ALIGN on the final return below. Lets sync offline and
> refer to the fw spec.I think the page align for total size should works, let me try
>> +       }
>> +
>> +       return total_size;
>> +}
>>
> alan:snip
>> @@ -302,9 +334,11 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
>>          xe_gt_assert(gt, ads->bo);
>>   
>>          ads->golden_lrc_size = calculate_golden_lrc_size(ads);
>> +       ads->capture_size = 0; /* Clear capture_size before run guc_capture_prep_lists */
> alan: above line appears to look totally redundant when a reader
> looks at the very next line after. I wonder if its better
> to modify guc_capture_prep_lists so that is doesn't check
> ads->capture_size and always assumes ads's info_map is valid
> (since guc_capture_prep_lists is only called post_hwconfig,
> thus, we should already have valid engine masks ... and if we
> don't then we should add an error message in guc_capture_prep_lists).
Let me reconsider this part.

>> +       ads->capture_size = guc_capture_prep_lists(ads);
>>          ads->regset_size = calculate_regset_size(gt);
>>   
> alan:snip
> 
>> -static void guc_capture_list_init(struct xe_guc_ads *ads)
>> +static u32 guc_get_capture_engine_mask(struct xe_gt *gt, struct iosys_map *info_map,
>> +                                      u32 capture_class)
>> +{
>> +       struct xe_device *xe = gt_to_xe(gt);
>> +       u32 mask;
>> +
>> +       switch (capture_class) {
>> +       case GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE:
>> +               mask = info_map_read(xe, info_map, engine_enabled_masks[GUC_RENDER_CLASS]);
>> +               mask |= info_map_read(xe, info_map, engine_enabled_masks[GUC_COMPUTE_CLASS]);
>> +               break;
>> +
> alan: nit: Dont need a new line above after the break.
> Same thing for the rest of the cases after this one.
> 
> 
> alan:snip
> 
>> +static int guc_capture_prep_lists(struct xe_guc_ads *ads)
>>   {
>> +       struct xe_guc *guc = ads_to_guc(ads);
>> +       struct xe_gt *gt = ads_to_gt(ads);
>> +       u32 ads_ggtt, capture_offset, null_ggtt, total_size = 0;
>> +       struct iosys_map info_map;
>> +       bool ads_is_mapped;
>> +       size_t size = 0;
>> +       void *ptr;
>>          int i, j;
>> -       u32 addr = xe_bo_ggtt_addr(ads->bo) + guc_ads_capture_offset(ads);
>>   
>> -       /* FIXME: Populate a proper capture list */
>> +       ads_is_mapped = ads->capture_size != 0;
>> +       if (ads_is_mapped) {
> alan: as per above comment, we might have no reason to use "ads_is_mapped" by checking "ads->capture_size" and just fail
> if ads_ggtt is not valid. Let's discuss offline as i am under
> the impression this function is only getting called when we
> have valid hwconfig info.
> 
> alan:snip
let's reconsider this part.

> 
> 
>>          for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
>> -               for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
>> -                       ads_blob_write(ads, ads.capture_instance[i][j], addr);
>> -                       ads_blob_write(ads, ads.capture_class[i][j], addr);
>> +               bool write_empty_list;
>> +
>> +               for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
> alan: Actually, "GUC_LAST_ENGINE_CLASS" is part of a list of
> definitions used by GuC for information that is NOT related to
> register-capture. This other #define list separates out the values for
> render-vs-compute. In the register-capture case, GuC interface spec
> uses a single value for render-or-compute. So I would propose that
> farther down in this patch where we first add the enum for
> "GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE", we should also add
> GUC_CAPTURE_MAX_ENGINE_CLASS as 5. This actually looks like a bug
> that may still exist in i915 ... it should have been fixed when John
> Harrison fixed the ~'using wrong #defines for guc capture engine
> classes' a long time back. However, the bug doesnt manifest anything
> right now, with the curent definitions we have today, they both end
> up being '5'.
Will add GUC_CAPTURE_MAX_ENGINE_CLASS
> 
>> +                       u32 engine_mask = guc_get_capture_engine_mask(gt, &info_map, j);
>> +                       /* null list if we dont have said engine or list */
>> +                       if (!engine_mask) {
>> +                               ads_blob_write(ads, ads.capture_class[i][j], null_ggtt);
>> +                               ads_blob_write(ads, ads.capture_instance[i][j], null_ggtt);
>> +                               continue;
>> +                       }
>> +                       /********************************************************/
>> +                       /*** engine exists: start with engine-class registers ***/
>> +                       /********************************************************/
> alan:snip
> 
>> new file mode 100644
>> index 000000000000..bc6b682998e2
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
>> @@ -0,0 +1,300 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2021-2022 Intel Corporation
> alan:snip..
> 
>> +#include "xe_hw_engine_types.h"
> alan:nit: xe_hw_engine_types shoiuld come after xe_g*** (alphabetical)
To be updated
>> +#include "xe_gt.h"
>> +#include "xe_gt_printk.h"
>> +#include "xe_guc.h"
>> +#include "xe_guc_capture.h"
>> +#include "xe_guc_capture_fwif.h"
>> +#include "xe_guc_ct.h"
>> +
>> +#include "xe_guc_log.h"
>> +#include "xe_gt_mcr.h"
>> +#include "xe_guc_submit.h"
>> +#include "xe_macros.h"
>> +#include "xe_map.h"
> alan: any reason why above 5 headers are grouped separately?...
> why not group together with the group before it (but will also
> need alphabetical reshuffling either way).
To be fixed
> 
> 
> 
> alan:snip
> 
>> +static const struct __guc_mmio_reg_descr_group *
>> +guc_capture_get_device_reglist(struct xe_guc *guc)
>> +{
>> +       //FIXME: add register list
>> +       return NULL;
> alan: i am not sure if its acceptable to define all these functions that help manage the
> register list and extended list when we have not defined what these lists could look like.
> I am familiar with the code from i915 but i think another reviewer may have a hard time
> trying to understand what reglists[i] could look like without any example of it getting
> instanced with populated values. i think it might be better to move the ext-list code into
> the 2nd patch but bring the static register table from patch #2. This was how the series was
> organized in i915 during its review. Lets connect offline and check with other folks to
> be sure this is needed or if its okay. Of course if we do this, then every patch chunk
> that accesses "extlists" would also need to move from everywhere else in this patch to #2.
> 
> alan:snip
> 
I'm trying to make this patch smaller, I think I can add few register in 
this patch to let reviewer know what it looks like, and add most in next 
patch.

> 
>> +int
>> +xe_guc_capture_getlist(struct xe_guc *guc, u32 owner, u32 type, u32 classid, void **outptr)
>> +{
>>
> alan:snip
>> +
>> +       caplist = kzalloc(size, GFP_KERNEL);
> alan:switch to the drmm_kzalloc so its managed.
sure, will do. No more free.

>> +       if (!caplist) {
>> +               xe_gt_dbg(guc_to_gt(guc), "Failed to alloc cached register capture list");
>> +               return -ENOMEM;
>> +       }
>> +
>> +       /* populate capture list header */
>> +       tmp = caplist;
>> +       num_regs = guc_cap_list_num_regs(guc->capture, owner, type, classid);
>> +       listnode = (struct guc_debug_capture_list *)tmp;
>> +       listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, (u32)num_regs);
>> +
>> +       /* populate list of register descriptor */
>> +       tmp += sizeof(struct guc_debug_capture_list);
>> +       guc_capture_list_init(guc, owner, type, classid, (struct guc_mmio_reg *)tmp, num_regs);
>> +
>> +       /* cache this list */
>> +       cache->is_valid = true;
>> +       cache->ptr = caplist;
>> +       cache->size = size;
>> +       cache->status = 0;
>> +
>> +       *outptr = caplist;
>> +
>> +       return 0;
>> +}
>> +
>> +int
>> +xe_guc_capture_getnullheader(struct xe_guc *guc, void **outptr, size_t *size)
>> +{
> alan:snip
> 
> 
>> +
>> +       null_header = kzalloc(tmp, GFP_KERNEL);
> alan:switch to the drmm_kzalloc so its managed.
>> +       if (!null_header) {
>> +               xe_gt_dbg(guc_to_gt(guc), "Failed to alloc cached register capture null list");
>> +               return -ENOMEM;
>> +       }
>> +
>> +       gc->ads_null_cache = null_header;
>> +       *outptr = null_header;
>> +       *size = tmp;
>> +
>> +       return 0;
>> +}
>> +
>> +int xe_guc_capture_init(struct xe_guc *guc)
>> +{
>> +       guc->capture = kzalloc(sizeof(*guc->capture), GFP_KERNEL);
> alan:switch to the drmm_kzalloc so its managed.
>> +       if (!guc->capture)
>> +               return -ENOMEM;
>> +
>> +       guc->capture->reglists = guc_capture_get_device_reglist(guc);
>> +
>> +       INIT_LIST_HEAD(&guc->capture->outlist);
> alan: Since parsing and extration of guc produced register dump is only introduced in
> patch #5, the typical rule is we shouldnt have any mention of this variable until patch #5
> (note even in the main xe_guc_capture struct definition). However, lets check offline with
> committers if this is a strict enforcement considering or not... (since this is all
> FW-interaction code, identical to i915 and the merge will always come together).
> I suspect we'll have to fix it though.
Will reorgize this part.
> 
>> +       INIT_LIST_HEAD(&guc->capture->cachelist);
>> +
>> +       return 0;
>> +}
> 
> alan: Please move xe_guc_capture_destroy from patch #6 to this patch.
> It doesnt match up to have the list allocations in this patch and
> then have them destroyed in patch #6. Note however that with
> drmm_kzalloc, i am not sure u need a destroy but we can at
> least have it set 'guc->capture = NULL;' to catch any
> post-destruction accesses.
Will reorgize this part.
> 
>> diff --git a/drivers/gpu/drm/xe/xe_guc_capture.h b/drivers/gpu/drm/xe/xe_guc_capture.h
>> new file mode 100644
>> index 000000000000..a16dcbe87af0
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_guc_capture.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2021-2021 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_GUC_CAPTURE_H
>> +#define _XE_GUC_CAPTURE_H
>> +
>> +#include <linux/types.h>
>> +#include "xe_exec_queue_types.h"
> alan:why do we need this header?
> alan:snip
To be removed
> 
> 
>> diff --git a/drivers/gpu/drm/xe/xe_guc_capture_fwif.h b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
>> new file mode 100644
>> index 000000000000..4bb94ac1ff48
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
>> @@ -0,0 +1,177 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
> alan: nit: for each structure definition in this file, we should try to format
> the comments/documentation around it to follow the proper linux kernel documentation style.
> 
Will reformat
> alan:snip
> 
>> +/*
>> + * struct guc_debug_capture_list_header / struct guc_debug_capture_list
>> + *
>> + * As part of ADS registration, these header structures (followed by
>> + * an array of 'struct guc_mmio_reg' entries) are used to register with
>> + * GuC microkernel the list of registers we want it to dump out prior
>> + * to a engine reset.
>> + */
>> +struct guc_debug_capture_list_header {
>> +       u32 info;
>> +#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0)
>> +} __packed;
>> +
>> +struct guc_debug_capture_list {
>> +       struct guc_debug_capture_list_header header;
>> +       struct guc_mmio_reg regs[];
>> +} __packed;
>> +
>> +/*
>> + * struct __guc_mmio_reg_descr / struct __guc_mmio_reg_descr_group
>> + *
>> + * xe_guc_capture module uses these structures to maintain static
>> + * tables (per unique platform) that consists of lists of registers
>> + * (offsets, names, flags,...) that are used at the ADS regisration
>> + * time as well as during runtime processing and reporting of error-
>> + * capture states generated by GuC just prior to engine reset events.
>> + */
>> +struct __guc_mmio_reg_descr {
>> +       struct xe_reg reg;
>> +       u32 flags;
>> +       u32 mask;
>> +       const char *regname;
>> +};
>> +
>> +struct __guc_mmio_reg_descr_group {
>> +       const struct __guc_mmio_reg_descr *list;
>> +       u32 num_regs;
>> +       u32 owner; /* see enum guc_capture_owner */
>> +       u32 type; /* see enum guc_capture_type */
>> +       u32 engine; /* as per MAX_ENGINE_CLASS */
>> +       struct __guc_mmio_reg_descr *extlist; /* only used for steered registers */
>> +};
>> +
>> +/*
>> + * struct guc_state_capture_header_t / struct guc_state_capture_t /
>> + * guc_state_capture_group_header_t / guc_state_capture_group_t
>> + *
>> + * Prior to resetting engines that have hung or faulted, GuC microkernel
>> + * reports the engine error-state (register values that was read) by
>> + * logging them into the shared GuC log buffer using these hierarchy
>> + * of structures.
>> + */
>> +struct guc_state_capture_header_t {
>> +       u32 owner;
>> +#define CAP_HDR_CAPTURE_VFID GENMASK(7, 0)
>> +       u32 info;
>> +#define CAP_HDR_CAPTURE_TYPE GENMASK(3, 0) /* see enum guc_capture_type */
>> +#define CAP_HDR_ENGINE_CLASS GENMASK(7, 4) /* see GUC_MAX_ENGINE_CLASSES */
>> +#define CAP_HDR_ENGINE_INSTANCE GENMASK(11, 8)
>> +       u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
>> +       u32 guc_id; /* if type-instance, context index of hung context, else set to ~0 */
>> +       u32 num_mmios;
>> +#define CAP_HDR_NUM_MMIOS GENMASK(9, 0)
>> +} __packed;
>> +
>> +struct guc_state_capture_t {
>> +       struct guc_state_capture_header_t header;
>> +       struct guc_mmio_reg mmio_entries[];
>> +} __packed;
>> +
>> +enum guc_capture_group_types {
>> +       GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
>> +       GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
>> +       GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
>> +};
>> +
>> +struct guc_state_capture_group_header_t {
>> +       u32 owner;
> alan: If i read the the spec correctly, capture_group_header
> also has VFID field at GENMASK(7,0).
To be verified
>> +       u32 info;
>> +#define CAP_GRP_HDR_NUM_CAPTURES GENMASK(7, 0)
>> +#define CAP_GRP_HDR_CAPTURE_TYPE GENMASK(15, 8) /* guc_capture_group_types */
>> +} __packed;
>> +
>> +/* this is the top level structure where an error-capture dump starts */
>> +struct guc_state_capture_group_t {
>> +       struct guc_state_capture_group_header_t grp_header;
>> +       struct guc_state_capture_t capture_entries[];
>> +} __packed;
>> +
>>
> alan:snip
> 
>> diff --git a/drivers/gpu/drm/xe/xe_guc_fwif.h b/drivers/gpu/drm/xe/xe_guc_fwif.h
>> index 5474025271e3..ffd3171de65d 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_fwif.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_fwif.h
>>
>> @@ -166,6 +167,8 @@ struct guc_mmio_reg {
>>   #define GUC_REGSET_MASKED              BIT(0)
>>   #define GUC_REGSET_MASKED_WITH_VALUE   BIT(2)
>>   #define GUC_REGSET_RESTORE_ONLY                BIT(3)
>> +#define GUC_REGSET_STEERING_GROUP       GENMASK(15, 12)
> alan: I've never noticed this before but i think the spec may have changed since the original work because
> STEERING_GROUP is now 16..12.
> Lets sync offline.
> 
To be verified
> 
> alan:snip
> 
>> +/* Class indecies for capture_class and capture_instance arrays */
>> +enum {
>> +       GUC_CAPTURE_LIST_CLASS_RENDER_COMPUTE = 0,
>> +       GUC_CAPTURE_LIST_CLASS_VIDEO = 1,
>> +       GUC_CAPTURE_LIST_CLASS_VIDEOENHANCE = 2,
>> +       GUC_CAPTURE_LIST_CLASS_BLITTER = 3,
>> +       GUC_CAPTURE_LIST_CLASS_GSC_OTHER = 4,
> alan: as mentioned earlier, this is where we need a
> GUC_CAPTURE_LIST_CLASS_MAX = 5.
Will do
>> +};
>> +
>>   /* GuC Additional Data Struct */
> 
> 


More information about the Intel-xe mailing list