[PATCH v7 5/7] drm/xe/guc: Extract GuC error capture lists
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Fri Apr 19 18:50:27 UTC 2024
On Wed, 2024-03-27 at 13:40 -0700, Zhanjun Dong wrote:
> Upon the G2H Notify-Err-Capture event, parse through the
> GuC Log Buffer (error-capture-subregion) and generate one or
> more capture-nodes. A single node represents a single "engine-
> instance-capture-dump" and contains at least 3 register lists:
> global, engine-class and engine-instance. An internal link
> list is maintained to store one or more nodes.
alan:snip
> + * GUC --> notify context reset:
> + * -----------------------------
> + * --> G2H CONTEXT RESET
> + * L--> guc_handle_context_reset --> i915_capture_error_state
alan: i assume "i915_..." was a copy+paste typo - this should be an xe-kmd specific function name
> + * L--> devcoredump_snapshot(..IS_GUC_CAPTURE)
> + * --> xe_hw_engine_snapshot_capture(..IS_GUC_CAPTURE)
> + * L--> xe_hw_engine_snapshot_from_capture is where
> + * detach C from internal linked list and add it into
> + * xe_hw_engine_snapshot struct (if the context and
> + * engine of the event notification matches a node
> + * in the link list).
> + *
> + * User Sysfs / Debugfs
> + * --------------------
> + * --> xe_devcoredump_read->
> + * L--> xxx_snapshot_print
alan: maybe provide a proper name instead of the xxx.
> + * L--> xe_hw_engine_snapshot_print
> + * register lists values of the xe_hw_engine_snapshot
> + * saved from the error-engine-dump.
> + *
> + */
> +
alan:snip
> +static void
> +guc_capture_add_node_to_cachelist(struct xe_guc_state_capture *gc,
> + struct __guc_capture_parsed_output *node)
> +{
> + guc_capture_add_node_to_list(node, &gc->cachelist);
> +}
alan:cachelist functionality should only be introduced in patch #6
alan:snip
> +static struct __guc_capture_parsed_output *
> +guc_capture_get_prealloc_node(struct xe_guc *guc)
> +{
> + struct __guc_capture_parsed_output *found = NULL;
alan: again, this function is for using pre-allocated node in cacheslist
which would need to get introduced as part of patch #6. That said, this
patch should be starting with dynamic allocation and freeing of single
nodes per use so this patch would need to include "guc_capture_alloc_one_node"
and "guc_capture_delete_one_node" where the former is called instead of
guc_capture_get_prealloc_node (which is added later in patch 6 along with
cachelist usage). Also, without pre-allocated nodes, this patch needs
to also allocate the node->reginfo[datatype].regs.
NOTE: With preallocated nodes+cachelist coming in later patch and
no actual alocation in this patch, guc_capture_get_prealloc_node
will always fail in this patch. You could follow how the patches
were organized in i915, or, alternatively you can maybe pull in the
preallocation earlier as a patch 5 and this extraction patch
becomes patch 6.
alan:snip
> +static int
> +guc_capture_extract_reglists(struct xe_guc *guc, struct __guc_capture_bufstate *buf)
> +{
> + struct guc_state_capture_group_header_t ghdr = {0};
> + struct guc_state_capture_header_t hdr = {0};
> + struct __guc_capture_parsed_output *node = NULL;
alan:snip
> + } else if (node) {
> + /*
> + * Based on the current capture type and what we have so far,
> + * decide if we should add the current node into the internal
> + * linked list for match-up when i915_gpu_coredump calls later
alan: replace i915_.. with xe_..
alan:snip
> + numregs = FIELD_GET(CAP_HDR_NUM_MMIOS, hdr.num_mmios);
> + if (numregs > guc->capture->max_mmio_per_node) {
> + xe_gt_dbg(guc_to_gt(guc), "Register capture list extraction clipped by prealloc!\n");
> + numregs = guc->capture->max_mmio_per_node;
> + }
> + node->reginfo[datatype].num_regs = numregs;
> + regs = node->reginfo[datatype].regs;
alan: as explained above, if we introduce the preallocation in next patch, then
this "regs" also need to be allocated.
alan:snip
> +bailout:
> + if (node) {
> + /* If we have data, add to linked list for match-up when i915_gpu_coredump calls */
> + for (i = GUC_CAPTURE_LIST_TYPE_GLOBAL; i < GUC_CAPTURE_LIST_TYPE_MAX; ++i) {
> + if (node->reginfo[i].regs) {
> + guc_capture_add_node_to_outlist(guc->capture, node);
> + node = NULL;
> + break;
> + }
> + }
> + if (node) /* else return it back to cache list */
> + guc_capture_add_node_to_cachelist(guc->capture, node);
alan: repeat: cachelist hunks should be in patch 6.
alan: snip
> +static void __guc_capture_process_output(struct xe_guc *guc)
> +{
> + unsigned int buffer_size, read_offset, write_offset, full_count;
> + struct xe_uc *uc = container_of(guc, typeof(*uc), guc);
> + struct guc_log_buffer_state log_buf_state_local;
> + struct guc_log_buffer_state *log_buf_state;
> + struct __guc_capture_bufstate buf;
> + void *src_data = NULL;
> + bool new_overflow;
> + int ret;
> +
> + log_buf_state = (struct guc_log_buffer_state *)((ulong)guc->log.bo->vmap.vaddr +
> + (sizeof(struct guc_log_buffer_state) * GUC_CAPTURE_LOG_BUFFER));
> + src_data = (struct guc_log_buffer_state *)((ulong)guc->log.bo->vmap.vaddr +
> + xe_guc_get_log_buffer_offset(&guc->log, GUC_CAPTURE_LOG_BUFFER));
alan: I believe if we are not allowed to directly use the vmap.vaddr in xe-kmd. Aren't
we only supposed to use the iosys_map functions? or is there an exception to this rule if
we do additional pinning/mapping elsewhere? Actually, i realize that on this topic,
this patch would also need the same iosys_map fix for "guc_capture_data_extracted"
and "guc_capture_log_remove_dw" which shouldnt be using the virt address directly.
This is probably the only part of the lower level FW-interface-specific extraction
code that will differ from i915 only because of the bo virt-address access
method rules of xe-kmd.
alan:snip
> +
> + /*
> + * Make a copy of the state structure, inside GuC log buffer
> + * (which is uncached mapped), on the stack to avoid reading
> + * from it multiple times.
> + */
> + memcpy(&log_buf_state_local, log_buf_state, sizeof(struct guc_log_buffer_state));
alan: same thing for the state as well - are we only supposed to be using iosys_map helpers?
(lets' connect offline - this will depend on how we are allocating and mapping the log buffer
bo which contains the guc-err-state-capture region).
> +
> + buffer_size = xe_guc_get_log_buffer_size(&guc->log, GUC_CAPTURE_LOG_BUFFER);
> + read_offset = log_buf_state_local.read_ptr;
> + write_offset = log_buf_state_local.sampled_write_ptr;
> + full_count = log_buf_state_local.buffer_full_cnt;
alan:snip
> -static int guc_read_stopped(struct xe_guc *guc)
> +int guc_read_stopped(struct xe_guc *guc)
alan: when promoting from static to something that is exported / used across subsystems,
then u might need to change the name to include the "xe_" prefix
> {
> return atomic_read(&guc->submission_state.stopped);
alan:snip
More information about the Intel-xe
mailing list