[PATCH v8 1/6] drm/xe/guc: Prepare GuC register list and update ADS size for error capture
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Fri May 10 18:43:14 UTC 2024
On Mon, 2024-05-06 at 18:47 -0700, Zhanjun Dong wrote:
> Add referenced registers defines and list of registers.
> Update GuC ADS size allocation to include space for
> the lists of error state capture register descriptors.
>
>
alan:snip
> +int xe_calculate_guc_regs_capture_worst_size(struct xe_guc *guc)
> +{
> + size_t total_size, class_size, instance_size, global_size;
> + int i, j;
> +
> + /* This function calculates the worst case register lists size by
> + * including all possible engines classes. It is called during the
> + * first of a two-phase GuC (and ADS-population) initialization
> + * sequence, that is, during the pre-hwconfig phase before we have
> + * the exact engine fusing info.
> + */
> + total_size = PAGE_SIZE; /* Pad a page in front for empty lists */
> + for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> + for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
alan: so in rev6, i mentioned in one of the other hunks that in guc interface,
the 'max engine class' definition used by guc-error-capture register-list-arrays
does not follow the definition of engine-class enum of above (which is used
for submission and other interfaces with GuC). I notice you fixed this on
another hunk but missed it above. I should have made it clear this change is
required for ALL cases where we are looping through the engine classes' register
lists used in guc- error-capture.
> + if (xe_guc_capture_getlistsize(guc, i,
> + GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> + j, &class_size) < 0)
> + class_size = 0;
> + if (xe_guc_capture_getlistsize(guc, i,
> + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> + j, &instance_size) < 0)
>
alan:snip
> #define MAX_GOLDEN_LRC_SIZE (SZ_4K * 64)
>
> int xe_guc_ads_init(struct xe_guc_ads *ads)
> @@ -398,6 +443,7 @@ int xe_guc_ads_init(struct xe_guc_ads *ads)
> struct xe_bo *bo;
>
> ads->golden_lrc_size = calculate_golden_lrc_size(ads);
> + ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
> ads->regset_size = calculate_regset_size(gt);
> ads->ads_waklv_size = calculate_waklv_size(ads);
>
> @@ -431,9 +477,10 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
> xe_gt_assert(gt, ads->bo);
>
> ads->golden_lrc_size = calculate_golden_lrc_size(ads);
> + ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
alan: nit: we had an offline conversation that despite post_hw_config knowing
exact engine class list, we simpliify and improve execution time by using
worst-case-size in both pre and post hwconfig when deciding on size. Would be nice to
add a comment on that? (so folks dont assume its a bug)
alan:snip
> ads->regset_size = calculate_regset_size(gt);
>
> - xe_gt_assert(gt, ads->golden_lrc_size +
> + xe_gt_assert(gt, ads->golden_lrc_size + ads->capture_size +
> (ads->regset_size - prev_regset_size) <=
> MAX_GOLDEN_LRC_SIZE);
alan: i missed this in rev6 - why are we using ads->capture_size to verify MAX_GOLDEN_LRC_SIZE is
sufficient? i dont believe this macro is related to guc-error-capture reg list.
alan:snip
> +static int guc_capture_prep_lists(struct xe_guc_ads *ads)
> +{
> + struct xe_guc *guc = ads_to_guc(ads);
> + struct xe_gt *gt = ads_to_gt(ads);
> + u32 ads_ggtt, capture_offset, null_ggtt, total_size = 0;
> + struct iosys_map info_map;
> + size_t size = 0;
> + void *ptr;
> int i, j;
alan:snip
> + /********************************************************/
> + /*** engine exists: start with engine-class registers ***/
> + /********************************************************/
> + write_empty_list = true; /* starting assumption is an empty list */
> + size = 0;
> + if (!xe_guc_capture_getlistsize(guc, i,
> + GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> + j, &size)) {
> + if (total_size + size > ads->capture_size)
> + xe_gt_dbg(gt, "Capture size overflow :%lu vs %d\n",
> + total_size + size, ads->capture_size);
alan: i think premerge hooks are failing due to format specifiers - please do fix that.
alan:snip
> + if (ads->capture_size != PAGE_ALIGN(total_size))
> + xe_gt_info(gt, "ADS capture alloc size changed from %d to %d\n",
> + ads->capture_size, PAGE_ALIGN(total_size));
> + return PAGE_ALIGN(total_size);
> }
alan:snip
> +#define COMMON_XELP_BASE_GLOBAL \
> + { FORCEWAKE_GT, 0, 0, "FORCEWAKE" }
> +
> +#define COMMON_BASE_ENGINE_INSTANCE \
> + { RING_ESR(0), 0, 0, "ESR" }, \
> + { RING_EMR(0), 0, 0, "EMR" }, \
> + { RING_EIR(0), 0, 0, "EIR" }, \
> + { RING_EXECLIST_STATUS_HI(0), 0, 0, "RING_EXECLIST_STATUS_HI" }, \
> + { RING_EXECLIST_STATUS_LO(0), 0, 0, "RING_EXECLIST_STATUS_LO" }, \
> + { RING_DMA_FADD(0), 0, 0, "RING_DMA_FADD_LDW" }, \
> + { RING_DMA_FADD_UDW(0), 0, 0, "RING_DMA_FADD_UDW" }, \
> + { RING_IPEHR(0), 0, 0, "IPEHR" }, \
> + { RING_BBADDR(0), 0, 0, "RING_BBADDR_LOW32" }, \
> + { RING_BBADDR_UDW(0), 0, 0, "RING_BBADDR_UP32" }, \
> + { RING_ACTHD(0), 0, 0, "ACTHD_LDW" }, \
> + { RING_ACTHD_UDW(0), 0, 0, "ACTHD_UDW" }, \
> + { RING_START(0), 0, 0, "START" }, \
> + { RING_HEAD(0), 0, 0, "HEAD" }, \
> + { RING_TAIL(0), 0, 0, "TAIL" }, \
> + { RING_CTL(0), 0, 0, "CTL" }, \
> + { RING_MI_MODE(0), 0, 0, "MODE" }, \
> + { RING_HWS_PGA(0), 0, 0, "HWS" }, \
> + { RING_MODE(0), 0, 0, "GFX_MODE" }
alan: I notice that "struct __guc_mmio_reg_descr.regname" is introduced here in
Patch-#1 and used in the population of the static reglist table above but then
never ever referenced in the entire series. In fact in Patch #6, its removed.
NOTE: this is different from i915 where this variable was used in the reporting
patch of i915's series but it looks like from this xe series version, we are
creating a shared "register->offset->name" lookup table list in xe_hw_engine.c
for both guc and execlist in Patch 6.
I believe this flirts dangerously close to breaking linux patch rules. A variable/
func not used in the entire series really shouldnt be added. In this case, we are
NOT consuming regname at all anywhere in the series. And then we remove it in the
subsequent patch-6.
alan:snip
> +static int
> +guc_capture_getlistsize(struct xe_guc *guc, u32 owner, u32 type, u32 classid,
> + size_t *size, bool is_purpose_est)
> +{
> + struct xe_guc_state_capture *gc = guc->capture;
> + struct __guc_capture_ads_cache *cache = &gc->ads_cache[owner][type][classid];
> + int num_regs;
> +
> + if (!gc->reglists) {
> + xe_gt_warn(guc_to_gt(guc), "No capture reglist for this device\n");
> + return -ENODEV;
> + }
> +
> + if (cache->is_valid) {
> + *size = cache->size;
> + return cache->status;
> + }
> +
> + if (!is_purpose_est && owner == GUC_CAPTURE_LIST_INDEX_PF &&
alan: so this one took me a while to figure out why it has become orphaned.
(after i went through the entire series hunting for it). I'm referign to "is_purpose_est".
So this is another i915 inherittance which did have valid reasoning. In i915 when we
were doing "check_guc_capture_size" (which is a one-time event), we wanted to flag if
any of the register tables were non existent. take note that at this early log-buffer-
calculation time, we assumed all engine classes are valid on the platform at
pre-hw-config time. The reason is because we did actually discover cases when new
platforms added new engines and the developer completely missed adding the register
table for error capture as well. With this message, it would get flagged as a warning.
However, this same lower level function was also used for the regular register
list population which can get called everytime we reset guc including coming
out of suspend or if we performed a GT reset and so we didn't want that warning
to happen every single time - only once per module load. so thats why we
created a wraper with that "is_purpose_est" to differentate if the caller
was for the log-buffer-size-estmation or for actual population where we only
print the warn in the former.
so my review comment would be: we already learnt the lessons from i915 so lets
use this properly. i suggest we fix patch #3 so that check_guc_capture_size
will call above guc_capture_getlistsize with is_purpose_est set true as opposed to
calling xe_calculate_guc_regs_capture_worst_size. I'll make the same comment
in Patch3
alan:snip
> +
> +#endif /* _XE_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/xe/xe_guc_capture_fwif.h b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
> new file mode 100644
> index 000000000000..79bc277afaa8
> --- /dev/null
alan:snip
> +struct guc_state_capture_group_header_t {
> + u32 owner;
alan: If i read the the spec correctly, capture_group_header's owner
has VFID field at GENMASK(7,0).
> + u32 info;
> +#define CAP_GRP_HDR_NUM_CAPTURES GENMASK(7, 0)
> +#define CAP_GRP_HDR_CAPTURE_TYPE GENMASK(15, 8) /* guc_capture_group_types */
> +} __packed;
alan:snip
More information about the Intel-xe
mailing list