[Intel-gfx] [PATCH v13 00/13] Add GuC Error Capture Support
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Mar 23 13:19:53 UTC 2022
Hi,
On 21/03/2022 16:45, Alan Previn wrote:
> This series:
> 1. Enables support of GuC to report error-state-capture
> using a list of MMIO registers the driver registers
> and GuC will dump, log and notify right before a GuC
> triggered engine-reset event.
> 2. Updates the ADS blob creation to register said lists
> of global, engine class and engine instance registers
> with GuC.
> 3. Defines tables of register lists that are global or
> engine class or engine instance in scope.
> 4. Updates usage and buffer-state data for the regions
> of the shared GuC log-buffer to accomdate both
> the existing relay logging of general debug logs
> along with the new error state capture usage.
> 5. Using a pool of preallocated memory, provide ability
> to extract and format the GuC reported register-capture
> data into chunks consistent with existing i915 error-
> state collection flows and structures.
> 6. Connects the i915_gpu_coredump reporting function
> to the GuC error capture module to print all GuC
> error state capture dumps that is reported.
Story is finished with this series and we have everything on feature
parity? Intel_error_decode handles it fine?
Would you have an example error capture at hand, I am curious how it looks?
Regards,
Tvrtko
>
> This is the 13th rev of this series where the first 3 revs
> are RFC
>
> Prior receipts of rvb's:
> - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
> R-v-b's from Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
> <matthew.brost at intel.com>. NOTE: some of these came in on the
> trybot series. https://patchwork.freedesktop.org/series/100831/
>
> Changes from prior revs:
> v13:- Fixing register list definition styling as per Jani's request.
> v12:- Re-sending it because previous revs only got to intel-gfx,
> and only cover letter was in dri-devel. Also rebased again.
> v11:- Rebase again on latest drm-tip to fix merge error.
> v10:- Rebase on latest drm-tip again. Fix a number of checkpatch
> warnings and an error Reported-by: kernel test robot <lkp at intel.com>.
> v9: - Rebase on latest drm-tip to solve CI merge-build error.
> v8: - Fix a bug found by CI in rev7: Create a cached ADS
> capture list for null-header like the other lists.
> - Fixed a bug on the ggtt offset calculation in the
> ADS population loop. Thanks to Matt Brost.
> - Change the storage uses for initial allocation and
> caching of the ADS register lists so we only store
> a regular pointer instead of file handle.
> - Multiple improvements on code styling, variable names,
> comments and code reduction from Umesh suggestions
> across multiple patches.
>
> v7: - Rebased on lastest drm_tip that has the ADS now using
> shmem based ads_blob_write utilities. Stress test
> was performed with this patch included to fix a
> legacy bug:
> https://patchwork.freedesktop.org/series/100768/
>
> v6: - In patch #1, ADS reg-list population, we now alloc
> regular memory to create the lists and cache them for
> simpler and faster use by GuC ADS module at init,
> suspend-resume and reset cycles. This was in response
> to review comments from Lucas De Marchi that also
> wanted to ensure the GuC ADS module owns the final
> copying into the ADS phyical memory.
> - Thanks to Jani Nikula for pointing out that patch #2
> and #3 should ensure static tables as constant and
> dynamic lists should be allocated and cached but
> attached to the GT level for the case of multiple
> cards with different fusings for steered registers.
> These are addressed now along with multiple code
> style fixups (thanks to review comment from Umesh)
> and splitting the steered register list generation
> as a seperate patch.
> - The extraction functionality, Patch #10 and #11 (was
> patch #7), has fixed all of Umesh's review comments
> related to the code styling. Additionally, it was
> discovered during stress tests that the extraction
> function could be called by the ct processing thread
> at the same time as the start of a GT reset event.
> Thus, a redesign was done whereby the linked list of
> processed capture-output-nodes are allocated up
> front and reused throughout the driver's life to
> ensure no memory locks are taken during extraction.
> - For patch #6 (now 7, 8 and 9), updates to
> intel_guc_log was split into smaller chunks and the
> log_state structure was returned back to inside of
> the intel_guc_log struct as opposed to the
> intel_guc struct in prior rev. This is in response
> to review comments by Matt Brost.
> - #Patch 13 (previously #10) is mostly identical but
> addresses all of the code styling comments reviews
> from Umesh.
>
> v5: - Added Gen9->Gen11 register list for CI coverage that
> included Gen9 with GuC submission.
> - Redesigned the extraction of the GuC error-capture
> dumps by grouping them into complete per-engine-reset
> nodes. Complete here means each node includes the
> global, engine-class and engine-instance register
> lists in a single structure.
> - Extraction is decoupled from the print-out. We now
> do the extraction immediately when receiving the
> G2H for error-capture notification. A link list of
> nodes is maintained with a FIFO based threshold
> while awaiting retrieval from i915_gpu_coredump's
> capture_engine function.
> - Added new plumbing through the i915_gpu_coredump
> allocation and capture functions to include a flag
> that is used indicate that GuC had triggered the
> reset. This new plumbing guarantees an exact match
> from i915_gpu_coredump's per-engine vma recording
> and node-retrieval from the guc-error-capture.
> - Broke the coredump gt_global capture and recording
> functions into smaller subsets so we can reuse as
> much of the existing legacy register reading + printing
> functions and only rely on GuC error-capture for
> the smaller subset of registers that are tied to
> engine workload execution.
> - Updated the register list to follow the legacy execlist
> format of printout.
> v4:
> - Rebased on latest drm-tip that has been merged with the
> support of GuC firmware version 69.0.3 that is required
> for GuC error-state-catpure to work.
> - Added register list for DG2 which is the same as XE_LP
> except an additional steering register set.
> - Fixed a bug in the end of capture parsing loop in
> intel_guc_capture_out_print_next_group that was not
> properly comparing the engine-instance and engine-
> class being parsed against the one that triggered
> the i915_gpu_coredump.
> v3:
> - Fixed all review comments from rev2 except the following:
> - Michal Wajdeczko proposed adding a seperate function
> to lookup register string nameslookup (based on offset)
> but decided against it because of offset conflicts
> and the current table layout is easier to maintain.
> - Last set of checkpatch errors pertaining to "COMPLEX
> MACROS" should be fixed on next rev.
> - Abstracted internal-to-guc-capture information into a new
> __guc_state_capture_priv structure that allows the exclusion
> of intel_guc.h and intel_guc_fwif.h from intel_guc_capture.h.
> Now, only the first 2 patches have a wider build time
> impact because of the changes to intel_guc_fwif.h but
> subsequent changes to guc-capture internal structures
> or firmware interfaces used solely by guc-capture module
> shoudn't impact the rest of the driver build.
> - Added missing Gen12LP registers and added slice+subslice
> indices when reporting extended steered registers.
> - Add additional checks to ensure that the GuC reported
> error capture information matches the i915_gpu_coredump
> that is being printed before we print out the corresponding
> VMA dumps such as the batch buffer.
> v2:
> - Ignore - failed CI retest.
>
> Alan Previn (13):
> drm/i915/guc: Update GuC ADS size for error capture lists
> drm/i915/guc: Add XE_LP static registers for GuC error capture.
> drm/i915/guc: Add XE_LP steered register lists support
> drm/i915/guc: Add DG2 registers for GuC error state capture.
> drm/i915/guc: Add Gen9 registers for GuC error state capture.
> drm/i915/guc: Add GuC's error state capture output structures.
> drm/i915/guc: Update GuC-log relay function names
> drm/i915/guc: Add capture region into intel_guc_log
> drm/i915/guc: Check sizing of guc_capture output
> drm/i915/guc: Extract GuC error capture lists on G2H notification.
> drm/i915/guc: Pre-allocate output nodes for extraction
> drm/i915/guc: Plumb GuC-capture into gpu_coredump
> drm/i915/guc: Print the GuC error capture output register list.
>
> drivers/gpu/drm/i915/Makefile | 1 +
> drivers/gpu/drm/i915/gt/intel_engine_cs.c | 4 +-
> .../drm/i915/gt/intel_execlists_submission.c | 4 +-
> drivers/gpu/drm/i915/gt/intel_reset.c | 2 +-
> .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h | 7 +
> drivers/gpu/drm/i915/gt/uc/guc_capture_fwif.h | 218 +++
> drivers/gpu/drm/i915/gt/uc/intel_guc.c | 13 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc.h | 12 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 127 +-
> .../gpu/drm/i915/gt/uc/intel_guc_capture.c | 1655 +++++++++++++++++
> .../gpu/drm/i915/gt/uc/intel_guc_capture.h | 33 +
> drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h | 14 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc_log.c | 127 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc_log.h | 7 +-
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 18 +-
> drivers/gpu/drm/i915/i915_debugfs.c | 3 +-
> drivers/gpu/drm/i915/i915_gpu_error.c | 282 ++-
> drivers/gpu/drm/i915/i915_gpu_error.h | 35 +-
> 18 files changed, 2379 insertions(+), 183 deletions(-)
> create mode 100644 drivers/gpu/drm/i915/gt/uc/guc_capture_fwif.h
> create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
>
More information about the Intel-gfx
mailing list