[PATCH v7 5/7] drm/xe/guc: Extract GuC error capture lists

Fri Apr 19 18:50:27 UTC 2024

On Wed, 2024-03-27 at 13:40 -0700, Zhanjun Dong wrote:
> Upon the G2H Notify-Err-Capture event, parse through the
> GuC Log Buffer (error-capture-subregion) and generate one or
> more capture-nodes. A single node represents a single "engine-
> instance-capture-dump" and contains at least 3 register lists:
> global, engine-class and engine-instance. An internal link
> list is maintained to store one or more nodes.
alan:snip

> + * GUC --> notify context reset:
> + * -----------------------------
> + *     --> G2H CONTEXT RESET
> + *                   L--> guc_handle_context_reset --> i915_capture_error_state
alan: i assume "i915_..." was a copy+paste typo - this should be an xe-kmd specific function name

> + *                          L--> devcoredump_snapshot(..IS_GUC_CAPTURE)
> + *                               --> xe_hw_engine_snapshot_capture(..IS_GUC_CAPTURE)
> + *                               L--> xe_hw_engine_snapshot_from_capture is where
> + *                                    detach C from internal linked list and add it into
> + *                                    xe_hw_engine_snapshot struct (if the context and
> + *                                    engine of the event notification matches a node
> + *                                    in the link list).
> + *
> + * User Sysfs / Debugfs
> + * --------------------
> + *      --> xe_devcoredump_read->
> + *             L--> xxx_snapshot_print
alan: maybe provide a proper name instead of the xxx.
> + *                    L--> xe_hw_engine_snapshot_print
> + *                         register lists values of the xe_hw_engine_snapshot
> + *                         saved from the error-engine-dump.
> + *
> + */
> +
alan:snip


> +static void
> +guc_capture_add_node_to_cachelist(struct xe_guc_state_capture *gc,
> +                                 struct __guc_capture_parsed_output *node)
> +{
> +       guc_capture_add_node_to_list(node, &gc->cachelist);
> +}
alan:cachelist functionality should only be introduced in patch #6
alan:snip

> +static struct __guc_capture_parsed_output *
> +guc_capture_get_prealloc_node(struct xe_guc *guc)
> +{
> +       struct __guc_capture_parsed_output *found = NULL;
alan: again, this function is for using pre-allocated node in cacheslist
which would need to get introduced as part of patch #6. That said, this
patch should be starting with dynamic allocation and freeing of single
nodes per use so this patch would need to include "guc_capture_alloc_one_node"
and "guc_capture_delete_one_node" where the former is called instead of
guc_capture_get_prealloc_node (which is added later in patch 6 along with
cachelist usage). Also, without pre-allocated nodes, this patch needs
to also allocate the node->reginfo[datatype].regs.
NOTE: With preallocated nodes+cachelist coming in later patch and
no actual alocation in this patch, guc_capture_get_prealloc_node
will always fail in this patch. You could follow how the patches
were organized in i915, or, alternatively you can maybe pull in the
preallocation earlier as a patch 5 and this extraction patch
becomes patch 6.
alan:snip

> +static int
> +guc_capture_extract_reglists(struct xe_guc *guc, struct __guc_capture_bufstate *buf)
> +{
> +       struct guc_state_capture_group_header_t ghdr = {0};
> +       struct guc_state_capture_header_t hdr = {0};
> +       struct __guc_capture_parsed_output *node = NULL;
alan:snip

> +               } else if (node) {
> +                       /*
> +                        * Based on the current capture type and what we have so far,
> +                        * decide if we should add the current node into the internal
> +                        * linked list for match-up when i915_gpu_coredump calls later
alan: replace i915_.. with xe_..
alan:snip

> +               numregs = FIELD_GET(CAP_HDR_NUM_MMIOS, hdr.num_mmios);
> +               if (numregs > guc->capture->max_mmio_per_node) {
> +                       xe_gt_dbg(guc_to_gt(guc), "Register capture list extraction clipped by prealloc!\n");
> +                       numregs = guc->capture->max_mmio_per_node;
> +               }
> +               node->reginfo[datatype].num_regs = numregs;
> +               regs = node->reginfo[datatype].regs;
alan: as explained above, if we introduce the preallocation in next patch, then
this "regs" also need to be allocated.
alan:snip
> +bailout:
> +       if (node) {
> +               /* If we have data, add to linked list for match-up when i915_gpu_coredump calls */
> +               for (i = GUC_CAPTURE_LIST_TYPE_GLOBAL; i < GUC_CAPTURE_LIST_TYPE_MAX; ++i) {
> +                       if (node->reginfo[i].regs) {
> +                               guc_capture_add_node_to_outlist(guc->capture, node);
> +                               node = NULL;
> +                               break;
> +                       }
> +               }
> +               if (node) /* else return it back to cache list */
> +                       guc_capture_add_node_to_cachelist(guc->capture, node);
alan: repeat: cachelist hunks should be in patch 6.
alan: snip

> +static void __guc_capture_process_output(struct xe_guc *guc)
> +{
> +       unsigned int buffer_size, read_offset, write_offset, full_count;
> +       struct xe_uc *uc = container_of(guc, typeof(*uc), guc);
> +       struct guc_log_buffer_state log_buf_state_local;
> +       struct guc_log_buffer_state *log_buf_state;
> +       struct __guc_capture_bufstate buf;
> +       void *src_data = NULL;
> +       bool new_overflow;
> +       int ret;
> +
> +       log_buf_state = (struct guc_log_buffer_state *)((ulong)guc->log.bo->vmap.vaddr +
> +                       (sizeof(struct guc_log_buffer_state) * GUC_CAPTURE_LOG_BUFFER));
> +       src_data = (struct guc_log_buffer_state *)((ulong)guc->log.bo->vmap.vaddr +
> +                  xe_guc_get_log_buffer_offset(&guc->log, GUC_CAPTURE_LOG_BUFFER));
alan: I believe if we are not allowed to directly use the vmap.vaddr in xe-kmd. Aren't
we only supposed to use the iosys_map functions? or is there an exception to this rule if
we do additional pinning/mapping elsewhere? Actually, i realize that on this topic,
this patch would also need the same iosys_map fix for "guc_capture_data_extracted"
and "guc_capture_log_remove_dw" which shouldnt be using the virt address directly.
This is probably the only part of the lower level FW-interface-specific extraction
code that will differ from i915 only because of the bo virt-address access
method rules of xe-kmd.
alan:snip

> +
> +       /*
> +        * Make a copy of the state structure, inside GuC log buffer
> +        * (which is uncached mapped), on the stack to avoid reading
> +        * from it multiple times.
> +        */
> +       memcpy(&log_buf_state_local, log_buf_state, sizeof(struct guc_log_buffer_state));
alan: same thing for the state as well - are we only supposed to be using iosys_map helpers?
(lets' connect offline - this will depend on how we are allocating and mapping the log buffer
bo which contains the guc-err-state-capture region).
> +
> +       buffer_size = xe_guc_get_log_buffer_size(&guc->log, GUC_CAPTURE_LOG_BUFFER);
> +       read_offset = log_buf_state_local.read_ptr;
> +       write_offset = log_buf_state_local.sampled_write_ptr;
> +       full_count = log_buf_state_local.buffer_full_cnt;

alan:snip

> -static int guc_read_stopped(struct xe_guc *guc)
> +int guc_read_stopped(struct xe_guc *guc)
alan: when promoting from static to something that is exported / used across subsystems,
then u might need to change the name to include the "xe_" prefix
>  {
>         return atomic_read(&guc->submission_state.stopped);
alan:snip