[PATCH v8 1/6] drm/xe/guc: Prepare GuC register list and update ADS size for error capture
Dong, Zhanjun
zhanjun.dong at intel.com
Tue May 14 22:44:58 UTC 2024
See my comments inline below.
Regards,
Zhanjun
On 2024-05-10 2:43 p.m., Teres Alexis, Alan Previn wrote:
> On Mon, 2024-05-06 at 18:47 -0700, Zhanjun Dong wrote:
>> Add referenced registers defines and list of registers.
>> Update GuC ADS size allocation to include space for
>> the lists of error state capture register descriptors.
>>
>>
> alan:snip
>> +int xe_calculate_guc_regs_capture_worst_size(struct xe_guc *guc)
>> +{
>> + size_t total_size, class_size, instance_size, global_size;
>> + int i, j;
>> +
>> + /* This function calculates the worst case register lists size by
>> + * including all possible engines classes. It is called during the
>> + * first of a two-phase GuC (and ADS-population) initialization
>> + * sequence, that is, during the pre-hwconfig phase before we have
>> + * the exact engine fusing info.
>> + */
>> + total_size = PAGE_SIZE; /* Pad a page in front for empty lists */
>> + for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
>> + for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
> alan: so in rev6, i mentioned in one of the other hunks that in guc interface,
> the 'max engine class' definition used by guc-error-capture register-list-arrays
> does not follow the definition of engine-class enum of above (which is used
> for submission and other interfaces with GuC). I notice you fixed this on
> another hunk but missed it above. I should have made it clear this change is
> required for ALL cases where we are looping through the engine classes' register
> lists used in guc- error-capture.
Thanks, will changed to GUC_CAPTURE_LIST_CLASS_MAX
>> + if (xe_guc_capture_getlistsize(guc, i,
>> + GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
>> + j, &class_size) < 0)
>> + class_size = 0;
>> + if (xe_guc_capture_getlistsize(guc, i,
>> + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
>> + j, &instance_size) < 0)
>>
> alan:snip
>
>> #define MAX_GOLDEN_LRC_SIZE (SZ_4K * 64)
>>
>> int xe_guc_ads_init(struct xe_guc_ads *ads)
>> @@ -398,6 +443,7 @@ int xe_guc_ads_init(struct xe_guc_ads *ads)
>> struct xe_bo *bo;
>>
>> ads->golden_lrc_size = calculate_golden_lrc_size(ads);
>> + ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
>> ads->regset_size = calculate_regset_size(gt);
>> ads->ads_waklv_size = calculate_waklv_size(ads);
>>
>> @@ -431,9 +477,10 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
>> xe_gt_assert(gt, ads->bo);
>>
>> ads->golden_lrc_size = calculate_golden_lrc_size(ads);
>> + ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
> alan: nit: we had an offline conversation that despite post_hw_config knowing
> exact engine class list, we simpliify and improve execution time by using
> worst-case-size in both pre and post hwconfig when deciding on size. Would be nice to
> add a comment on that? (so folks dont assume its a bug)
Sure, will do.
>
> alan:snip
>
>
>> ads->regset_size = calculate_regset_size(gt);
>>
>> - xe_gt_assert(gt, ads->golden_lrc_size +
>> + xe_gt_assert(gt, ads->golden_lrc_size + ads->capture_size +
>> (ads->regset_size - prev_regset_size) <=
>> MAX_GOLDEN_LRC_SIZE);
> alan: i missed this in rev6 - why are we using ads->capture_size to verify MAX_GOLDEN_LRC_SIZE is
> sufficient? i dont believe this macro is related to guc-error-capture reg list.
>
Yes, will remove it.
> alan:snip
>
>> +static int guc_capture_prep_lists(struct xe_guc_ads *ads)
>> +{
>> + struct xe_guc *guc = ads_to_guc(ads);
>> + struct xe_gt *gt = ads_to_gt(ads);
>> + u32 ads_ggtt, capture_offset, null_ggtt, total_size = 0;
>> + struct iosys_map info_map;
>> + size_t size = 0;
>> + void *ptr;
>> int i, j;
> alan:snip
>> + /********************************************************/
>> + /*** engine exists: start with engine-class registers ***/
>> + /********************************************************/
>> + write_empty_list = true; /* starting assumption is an empty list */
>> + size = 0;
>> + if (!xe_guc_capture_getlistsize(guc, i,
>> + GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
>> + j, &size)) {
>> + if (total_size + size > ads->capture_size)
>> + xe_gt_dbg(gt, "Capture size overflow :%lu vs %d\n",
>> + total_size + size, ads->capture_size);
> alan: i think premerge hooks are failing due to format specifiers - please do fix that.
> alan:snip
>> + if (ads->capture_size != PAGE_ALIGN(total_size))
>> + xe_gt_info(gt, "ADS capture alloc size changed from %d to %d\n",
>> + ads->capture_size, PAGE_ALIGN(total_size));
>> + return PAGE_ALIGN(total_size);
>> }
> alan:snip
>
>> +#define COMMON_XELP_BASE_GLOBAL \
>> + { FORCEWAKE_GT, 0, 0, "FORCEWAKE" }
>> +
>> +#define COMMON_BASE_ENGINE_INSTANCE \
>> + { RING_ESR(0), 0, 0, "ESR" }, \
>> + { RING_EMR(0), 0, 0, "EMR" }, \
>> + { RING_EIR(0), 0, 0, "EIR" }, \
>> + { RING_EXECLIST_STATUS_HI(0), 0, 0, "RING_EXECLIST_STATUS_HI" }, \
>> + { RING_EXECLIST_STATUS_LO(0), 0, 0, "RING_EXECLIST_STATUS_LO" }, \
>> + { RING_DMA_FADD(0), 0, 0, "RING_DMA_FADD_LDW" }, \
>> + { RING_DMA_FADD_UDW(0), 0, 0, "RING_DMA_FADD_UDW" }, \
>> + { RING_IPEHR(0), 0, 0, "IPEHR" }, \
>> + { RING_BBADDR(0), 0, 0, "RING_BBADDR_LOW32" }, \
>> + { RING_BBADDR_UDW(0), 0, 0, "RING_BBADDR_UP32" }, \
>> + { RING_ACTHD(0), 0, 0, "ACTHD_LDW" }, \
>> + { RING_ACTHD_UDW(0), 0, 0, "ACTHD_UDW" }, \
>> + { RING_START(0), 0, 0, "START" }, \
>> + { RING_HEAD(0), 0, 0, "HEAD" }, \
>> + { RING_TAIL(0), 0, 0, "TAIL" }, \
>> + { RING_CTL(0), 0, 0, "CTL" }, \
>> + { RING_MI_MODE(0), 0, 0, "MODE" }, \
>> + { RING_HWS_PGA(0), 0, 0, "HWS" }, \
>> + { RING_MODE(0), 0, 0, "GFX_MODE" }
> alan: I notice that "struct __guc_mmio_reg_descr.regname" is introduced here in
> Patch-#1 and used in the population of the static reglist table above but then
> never ever referenced in the entire series. In fact in Patch #6, its removed.
>
> NOTE: this is different from i915 where this variable was used in the reporting
> patch of i915's series but it looks like from this xe series version, we are
> creating a shared "register->offset->name" lookup table list in xe_hw_engine.c
> for both guc and execlist in Patch 6.
>
> I believe this flirts dangerously close to breaking linux patch rules. A variable/
> func not used in the entire series really shouldnt be added. In this case, we are
> NOT consuming regname at all anywhere in the series. And then we remove it in the
> subsequent patch-6.
Yes, will get it removed in next patch.
Meanwhile, as there are 2 register list exist in capture and
hw_engine.c, so I'm also consider to merge them. It could be done in
separate patch.
>
> alan:snip
>
>
>
>> +static int
>> +guc_capture_getlistsize(struct xe_guc *guc, u32 owner, u32 type, u32 classid,
>> + size_t *size, bool is_purpose_est)
>> +{
>> + struct xe_guc_state_capture *gc = guc->capture;
>> + struct __guc_capture_ads_cache *cache = &gc->ads_cache[owner][type][classid];
>> + int num_regs;
>> +
>> + if (!gc->reglists) {
>> + xe_gt_warn(guc_to_gt(guc), "No capture reglist for this device\n");
>> + return -ENODEV;
>> + }
>> +
>> + if (cache->is_valid) {
>> + *size = cache->size;
>> + return cache->status;
>> + }
>> +
>> + if (!is_purpose_est && owner == GUC_CAPTURE_LIST_INDEX_PF &&
> alan: so this one took me a while to figure out why it has become orphaned.
> (after i went through the entire series hunting for it). I'm referign to "is_purpose_est".
>
> So this is another i915 inherittance which did have valid reasoning. In i915 when we
> were doing "check_guc_capture_size" (which is a one-time event), we wanted to flag if
> any of the register tables were non existent. take note that at this early log-buffer-
> calculation time, we assumed all engine classes are valid on the platform at
> pre-hw-config time. The reason is because we did actually discover cases when new
> platforms added new engines and the developer completely missed adding the register
> table for error capture as well. With this message, it would get flagged as a warning.
> However, this same lower level function was also used for the regular register
> list population which can get called everytime we reset guc including coming
> out of suspend or if we performed a GT reset and so we didn't want that warning
> to happen every single time - only once per module load. so thats why we
> created a wraper with that "is_purpose_est" to differentate if the caller
> was for the log-buffer-size-estmation or for actual population where we only
> print the warn in the former.
>
> so my review comment would be: we already learnt the lessons from i915 so lets
> use this properly. i suggest we fix patch #3 so that check_guc_capture_size
> will call above guc_capture_getlistsize with is_purpose_est set true as opposed to
> calling xe_calculate_guc_regs_capture_worst_size. I'll make the same comment
> in Patch3
Thanks for point out. Will be fixed in patch 3
>
> alan:snip
>
>> +
>> +#endif /* _XE_GUC_CAPTURE_H */
>> diff --git a/drivers/gpu/drm/xe/xe_guc_capture_fwif.h b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
>> new file mode 100644
>> index 000000000000..79bc277afaa8
>> --- /dev/null
> alan:snip
>
>> +struct guc_state_capture_group_header_t {
>> + u32 owner;
> alan: If i read the the spec correctly, capture_group_header's owner
> has VFID field at GENMASK(7,0).
Will add.
>> + u32 info;
>> +#define CAP_GRP_HDR_NUM_CAPTURES GENMASK(7, 0)
>> +#define CAP_GRP_HDR_CAPTURE_TYPE GENMASK(15, 8) /* guc_capture_group_types */
>> +} __packed;
> alan:snip
>
More information about the Intel-xe
mailing list