[PATCH v5] drm/xe: Add driver load error injection

Rodrigo Vivi rodrigo.vivi at intel.com
Wed Sep 11 20:48:11 UTC 2024


On Wed, Sep 11, 2024 at 12:40:04PM +0200, Francois Dugast wrote:
> On Tue, Sep 10, 2024 at 04:33:21PM -0500, Lucas De Marchi wrote:
> > On Tue, Sep 10, 2024 at 05:11:34PM GMT, Rodrigo Vivi wrote:
> > > On Tue, Sep 10, 2024 at 05:22:41PM +0200, Francois Dugast wrote:
> > > > Those new macros inject errors by overriding return codes. They must
> > > > manually be called, preferably at the very beginning of the function
> > > > that will fault, otherwise if not possible by turning this pattern:
> > > > 
> > > >     err = foo();
> > > >     if (err)
> > > >         return err;
> > > > 
> > > > into:
> > > > 
> > > >     err = foo();
> > > >     err = xe_device_inject_driver_probe_error(xe, err);
> > > >     if (err)
> > > >         return err;
> > > > 
> > > > When CONFIG_DRM_XE_DEBUG is not set, this has no effect.
> > > > 
> > > > When CONFIG_DRM_XE_DEBUG is set, the error code at checkpoint X will
> > > > be overridden when the module argument inject_driver_load_error is
> > > > set to value X. By doing so, it is possible to test proper error
> > > > handling and improve robustness for current and future code. A few
> > > > injection points are added in this patch but more need to be added.
> > > > One way to use this error injection at driver probe is:
> > > > 
> > > >     for i in {1..200}; do
> > > >         echo "Run $i"
> > > >         modprobe xe inject_driver_probe_error=$i;
> > > >         rmmod xe;
> > > >     done
> > > 
> > > can we have an IGT test so we ensure that CI is tracking and we are working
> > > to close the existing issues?
> > 
> > yeah.. that would be great. I think it would make more sense to use
> > bind/unbind in igt.

Hmm... but that would require a deferred_probe and then the bind to force the reprobe...
kind of complicate things here...

> > 
> > > 
> > > > 
> > > > In the future this is expected to be replaced by the infrastructure
> > > > provided by fault-inject.h
> > > 
> > > I was taking a look at the fault-inject again. It could easily be a
> > > global fault_attr with a module sysfs entry, then during the test
> > > you load the module, then unbind the device, then change the fault-inject
> > > probability and time and then bind it back what will reprobe, but now
> > > with the fault-injected.
> > > 
> > > The only problem with the fault-inject idea is that it would require
> > > a very granular thing with multiple fault_attr, one per failure.
> > 
> > when going with a real fault-injection, I'd actually try to cover it per
> > function as described here:
> > 
> > https://docs.kernel.org/fault-injection/fault-injection.html
> > 	/sys/kernel/debug/fail_function/inject:
> > 
> > 	Format: { ‘function-name’ | ‘!function-name’ | ‘’ }
> > 
> > 	specifies the target function of error injection by name. If the
> > 	function name leads ‘!’ prefix, given function is removed from injection
> > 	list. If nothing specified (‘’) injection list is cleared.
> > 
> > Integration via ALLOW_ERROR_INJECTION() is similar to the
> > KUNIT_STATIC_STUB_REDIRECT() we already use.
> > 
> > In my review I didn't bother to go with fault-inject directly because we
> > will probably need to refactor the code so the failure points are in
> > their own functions. Something we don't have today. Short term it's
> > important to fix the current/unknown problems. Mid term we can convert
> > things piece meal.
> > 
> > Are we on the same page?
> 
> It is also my intention with this patch, get something in with minimal risk
> and changes so we can soon focus on solving potential issues it highlights.
> 
> In parallel I am preparing a RFC based on fault-inject with a proposal how
> we can use fail_function with a few real examples from our code that we can
> take more time to discuss thoroughly.

I'm also on the same page. Let's do it.

But we need to at least:
1. fix the documentation return statement
2. fix checkpatch on module_param_named_unsafe huge line
3. IGT ?!

> 
> Francois
> 
> > 
> > > But at least this really ensures that we are really testing all the cases
> > > with more reliability.
> > > 
> > > I just realized that this i915-style probe injection might have an issue
> > > on platforms with discrete platforms. Well, the pci subsystem won't
> > 
> > one more reason to go with the bind/unbind. Then you control where it's
> > happening and where.
> > 
> > Lucas De Marchi
> > 
> > > probe in parallel, and likely it will be the same order of probe on
> > > every module load, but if it doesn't the Nth point of the failure
> > > won't be the same everytime, so in every load you might stop in a
> > > different device and end up with not covering every single entry.
> > > Unlikely I know... And I don't believe this should be a blocker
> > > to move forward with something...
> > > 
> > > (more below)
> > > 
> > > > 
> > > > v2: Fix style and build errors, modparam to 0 after probe, rename to
> > > >     xe_device_inject_driver_probe_error, check type when compiled out,
> > > >     add _return macro, move some uses to the beginning of the function
> > > > v3: Rebase
> > > > v4: Improve commit message and comments, keep if/return rather than
> > > >     change the flow inside the macro (Lucas De Marchi)
> > > > v5: Rebase, add comments, keep existing return points (Lucas De Marchi)
> > > >     Add finish wrapper, move to function beginning for all xe functions
> > > >     (Michal Wajdeczko) Bolt into i915 error injection (Jani Nikula)
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > > Signed-off-by: Francois Dugast <francois.dugast at intel.com>
> > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/display/ext/i915_utils.c |  4 +-
> > > >  drivers/gpu/drm/xe/xe_device.c              | 48 +++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_device.h              | 30 +++++++++++++
> > > >  drivers/gpu/drm/xe/xe_device_types.h        |  5 +++
> > > >  drivers/gpu/drm/xe/xe_gt_sriov_pf_service.c |  5 +++
> > > >  drivers/gpu/drm/xe/xe_guc.c                 |  1 +
> > > >  drivers/gpu/drm/xe/xe_guc_ct.c              |  1 +
> > > >  drivers/gpu/drm/xe/xe_guc_pc.c              |  4 ++
> > > >  drivers/gpu/drm/xe/xe_mmio.c                |  5 +++
> > > >  drivers/gpu/drm/xe/xe_module.c              | 17 ++++++++
> > > >  drivers/gpu/drm/xe/xe_module.h              |  3 ++
> > > >  drivers/gpu/drm/xe/xe_pci.c                 |  5 +++
> > > >  drivers/gpu/drm/xe/xe_pm.c                  |  5 +++
> > > >  drivers/gpu/drm/xe/xe_sriov.c               |  7 ++-
> > > >  drivers/gpu/drm/xe/xe_sriov_pf.c            |  6 +++
> > > >  drivers/gpu/drm/xe/xe_tile.c                | 13 ++++++
> > > >  drivers/gpu/drm/xe/xe_uc.c                  |  4 ++
> > > >  drivers/gpu/drm/xe/xe_wa.c                  |  8 +++-
> > > >  drivers/gpu/drm/xe/xe_wopcm.c               |  7 ++-
> > > >  19 files changed, 172 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/display/ext/i915_utils.c b/drivers/gpu/drm/xe/display/ext/i915_utils.c
> > > > index 43b10a2cc508..11d8377a125f 100644
> > > > --- a/drivers/gpu/drm/xe/display/ext/i915_utils.c
> > > > +++ b/drivers/gpu/drm/xe/display/ext/i915_utils.c
> > > > @@ -4,6 +4,7 @@
> > > >   */
> > > > 
> > > >  #include "i915_drv.h"
> > > > +#include "xe_device.h"
> > > > 
> > > >  bool i915_vtd_active(struct drm_i915_private *i915)
> > > >  {
> > > > @@ -16,11 +17,10 @@ bool i915_vtd_active(struct drm_i915_private *i915)
> > > > 
> > > >  #if IS_ENABLED(CONFIG_DRM_I915_DEBUG)
> > > > 
> > > > -/* i915 specific, just put here for shutting it up */
> > > >  int __i915_inject_probe_error(struct drm_i915_private *i915, int err,
> > > >  			      const char *func, int line)
> > > >  {
> > > > -	return 0;
> > > > +	return __xe_device_inject_driver_probe_error(i915, err, 0, func, line);
> > > >  }
> > > > 
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > > index 449b85035d3a..f22d94ff302e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > @@ -319,6 +319,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
> > > >  	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > > >  			      xe->drm.anon_inode->i_mapping,
> > > >  			      xe->drm.vma_offset_manager, false, false);
> > > > +	err = xe_device_inject_driver_probe_error_override(xe, err);
> > > >  	if (WARN_ON(err))
> > > >  		goto err;
> > > > 
> > > > @@ -477,6 +478,7 @@ static int xe_set_dma_info(struct xe_device *xe)
> > > >  		goto mask_err;
> > > > 
> > > >  	err = dma_set_coherent_mask(xe->drm.dev, DMA_BIT_MASK(mask_size));
> > > > +	err = xe_device_inject_driver_probe_error_override(xe, err);
> > > >  	if (err)
> > > >  		goto mask_err;
> > > > 
> > > > @@ -498,6 +500,11 @@ static int wait_for_lmem_ready(struct xe_device *xe)
> > > >  {
> > > >  	struct xe_gt *gt = xe_root_mmio_gt(xe);
> > > >  	unsigned long timeout, start;
> > > > +	int err;
> > > > +
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > 
> > > >  	if (!IS_DGFX(xe))
> > > >  		return 0;
> > > > @@ -750,6 +757,8 @@ int xe_device_probe(struct xe_device *xe)
> > > >  	for_each_gt(gt, xe, id)
> > > >  		xe_gt_sanitize_freq(gt);
> > > > 
> > > > +	xe_device_inject_driver_probe_error_finish();
> > > > +
> > > >  	return devm_add_action_or_reset(xe->drm.dev, xe_device_sanitize, xe);
> > > > 
> > > >  err_fini_display:
> > > > @@ -1000,3 +1009,42 @@ void xe_device_declare_wedged(struct xe_device *xe)
> > > >  	for_each_gt(gt, xe, id)
> > > >  		xe_gt_declare_wedged(gt);
> > > >  }
> > > > +
> > > > +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > > > +/**
> > > > + * __xe_device_inject_driver_probe_error - Inject an error during device probe
> > > > + * @xe: xe device instance
> > > > + * @err_injected: the error to inject
> > > > + * @err_real: the error returned by the actual function
> > > > + * @func: the name of the function where this is called from
> > > > + * @line: the line where this is called from
> > > > + *
> > > > + * This is not meant to be called directly, only through xe_device_inject_driver_probe_error.
> > > > + *
> > > > + * Return: err_real if != 0, err_injected otherwise
> > > 
> > > Not just otherwise....
> > > 
> > > Return 0 if this is not the Nth iteration of the requested iterations from
> > > modparam.inject_driver_probe_error
> > > 
> > > Return err_injected if in the Nth iteration...
> > > 
> > > > + */
> > > > +int __xe_device_inject_driver_probe_error(struct xe_device *xe, int err_injected, int err_real,
> > > > +					  const char *func, int line)
> > > > +{
> > > > +	if (err_real != 0)
> > > > +		return err_real;
> > > > +
> > > > +	if (xe->inject_driver_probe_error >= xe_modparam.inject_driver_probe_error)
> > > > +		return 0;
> > > > +
> > > > +	if (++xe->inject_driver_probe_error < xe_modparam.inject_driver_probe_error)
> > > > +		return 0;
> > > > +
> > > > +	drm_info(&xe->drm, "Injecting failure %d at checkpoint %u [%s:%d]\n",
> > > > +		 err_injected, xe->inject_driver_probe_error, func, line);
> > > > +
> > > > +	xe_modparam.inject_driver_probe_error = 0;
> > > > +	return err_injected;
> > > > +}
> > > > +
> > > > +void __xe_device_inject_driver_probe_error_finish(void)
> > > > +{
> > > > +	/* After probe finishes, stop checking for error injection */
> > > > +	xe_modparam.inject_driver_probe_error = 0;
> > > > +}
> > > > +#endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > > > index 894f04770454..c410e55b6b09 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > @@ -178,4 +178,34 @@ void xe_device_declare_wedged(struct xe_device *xe);
> > > >  struct xe_file *xe_file_get(struct xe_file *xef);
> > > >  void xe_file_put(struct xe_file *xef);
> > > > 
> > > > +#define XE_DEVICE_INJECTED_ERR -ENODEV
> > > > +#define xe_device_inject_driver_probe_error(__xe) \
> > > > +	__xe_device_inject_driver_probe_error(__xe, XE_DEVICE_INJECTED_ERR, 0, __func__, __LINE__)
> > > > +#define xe_device_inject_driver_probe_error_override(__xe, __err_real) \
> > > > +	__xe_device_inject_driver_probe_error(__xe, XE_DEVICE_INJECTED_ERR, __err_real, __func__, \
> > > > +					      __LINE__)
> > > > +#define xe_device_inject_driver_probe_error_finish()	\
> > > > +	__xe_device_inject_driver_probe_error_finish()
> > > > +
> > > > +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > > > +
> > > > +int __xe_device_inject_driver_probe_error(struct xe_device *xe,
> > > > +					  int err_injected, int err_real,
> > > > +					  const char *func, int line);
> > > > +
> > > > +void __xe_device_inject_driver_probe_error_finish(void);
> > > > +
> > > > +#else
> > > > +
> > > > +static inline int __xe_device_inject_driver_probe_error(struct xe_device *xe,
> > > > +							int err_injected, int err_real,
> > > > +							const char *func, int line)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static inline void __xe_device_inject_driver_probe_error_finish(void) {};
> > > > +
> > > > +#endif
> > > > +
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > > index ec7eb7811126..582b8b7cdee4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > @@ -487,6 +487,11 @@ struct xe_device {
> > > >  		int mode;
> > > >  	} wedged;
> > > > 
> > > > +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > > > +	/** @inject_driver_probe_error: Counter used for error injection during probe */
> > > > +	int inject_driver_probe_error;
> > > > +#endif
> > > > +
> > > >  #ifdef TEST_VM_OPS_ERROR
> > > >  	/**
> > > >  	 * @vm_inject_error_position: inject errors at different places in VM
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_pf_service.c b/drivers/gpu/drm/xe/xe_gt_sriov_pf_service.c
> > > > index 0e23b7ea4f3e..b5da321bbbea 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_sriov_pf_service.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_pf_service.c
> > > > @@ -12,6 +12,7 @@
> > > >  #include "regs/xe_guc_regs.h"
> > > >  #include "regs/xe_regs.h"
> > > > 
> > > > +#include "xe_device.h"
> > > >  #include "xe_mmio.h"
> > > >  #include "xe_gt_sriov_printk.h"
> > > >  #include "xe_gt_sriov_pf_helpers.h"
> > > > @@ -275,6 +276,10 @@ int xe_gt_sriov_pf_service_init(struct xe_gt *gt)
> > > >  {
> > > >  	int err;
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(gt_to_xe(gt));
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	pf_init_versions(gt);
> > > > 
> > > >  	err = pf_alloc_runtime_info(gt);
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > > > index 5599464013bd..eb764b44ced7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > > > @@ -353,6 +353,7 @@ int xe_guc_init(struct xe_guc *guc)
> > > >  	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_LOADABLE);
> > > > 
> > > >  	ret = devm_add_action_or_reset(xe->drm.dev, guc_fini_hw, guc);
> > > > +	ret = xe_device_inject_driver_probe_error_override(guc_to_xe(guc), ret);
> > > >  	if (ret)
> > > >  		goto out;
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > index 4b95f75b1546..51ffb05605bb 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > @@ -202,6 +202,7 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
> > > >  	ct->bo = bo;
> > > > 
> > > >  	err = drmm_add_action_or_reset(&xe->drm, guc_ct_fini, ct);
> > > > +	err = xe_device_inject_driver_probe_error_override(xe, err);
> > > >  	if (err)
> > > >  		return err;
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > index 034b29984d5e..d27d843057e7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > @@ -1064,6 +1064,10 @@ int xe_guc_pc_init(struct xe_guc_pc *pc)
> > > >  	u32 size = PAGE_ALIGN(sizeof(struct slpc_shared_data));
> > > >  	int err;
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	if (xe->info.skip_guc_pc)
> > > >  		return 0;
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> > > > index 3fd462fda625..a4cf082d3261 100644
> > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > @@ -136,6 +136,11 @@ int xe_mmio_probe_tiles(struct xe_device *xe)
> > > >  {
> > > >  	size_t tile_mmio_size = SZ_16M;
> > > >  	size_t tile_mmio_ext_size = xe->info.tile_mmio_ext_size;
> > > > +	int err;
> > > > +
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > 
> > > >  	mmio_multi_tile_setup(xe, tile_mmio_size);
> > > >  	mmio_extension_setup(xe, tile_mmio_size, tile_mmio_ext_size);
> > > > diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
> > > > index 77ce9f9ca7a5..3de603e0438f 100644
> > > > --- a/drivers/gpu/drm/xe/xe_module.c
> > > > +++ b/drivers/gpu/drm/xe/xe_module.c
> > > > @@ -56,6 +56,23 @@ module_param_named_unsafe(force_probe, xe_modparam.force_probe, charp, 0400);
> > > >  MODULE_PARM_DESC(force_probe,
> > > >  		 "Force probe options for specified devices. See CONFIG_DRM_XE_FORCE_PROBE for details.");
> > > > 
> > > > +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > > > +/*
> > > > + * The error code at checkpoint X will be overridden when the module argument
> > > > + * inject_driver_load_error is set to value X. By doing so, it is possible to
> > > > + * test proper error handling and improve robustness for current and future
> > > > + * code. One way to test multiple error injection points:
> > > > + *
> > > > + * for i in {1..200}; do
> > > > + *     echo "Run $i"
> > > > + *     modprobe xe inject_driver_probe_error=$i;
> > > > + *     rmmod xe;
> > > > + * done
> > > > + */
> > > > +module_param_named_unsafe(inject_driver_probe_error, xe_modparam.inject_driver_probe_error, int, 0600);
> > > 
> > > we need to break this line... or perhaps get a smaller word for the param name?
> > > 
> > > > +MODULE_PARM_DESC(inject_driver_probe_error, "Inject driver probe error");
> > > > +#endif
> > > > +
> > > >  #ifdef CONFIG_PCI_IOV
> > > >  module_param_named(max_vfs, xe_modparam.max_vfs, uint, 0400);
> > > >  MODULE_PARM_DESC(max_vfs,
> > > > diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
> > > > index 161a5e6f717f..47cefaf8d79b 100644
> > > > --- a/drivers/gpu/drm/xe/xe_module.h
> > > > +++ b/drivers/gpu/drm/xe/xe_module.h
> > > > @@ -20,6 +20,9 @@ struct xe_modparam {
> > > >  	char *force_probe;
> > > >  #ifdef CONFIG_PCI_IOV
> > > >  	unsigned int max_vfs;
> > > > +#endif
> > > > +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > > > +	int inject_driver_probe_error;
> > > >  #endif
> > > >  	int wedged_mode;
> > > >  };
> > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > index 3bce0e550a63..9bb60b300727 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > @@ -644,8 +644,13 @@ static int xe_info_init(struct xe_device *xe,
> > > >  	u32 graphics_gmdid_revid = 0, media_gmdid_revid = 0;
> > > >  	struct xe_tile *tile;
> > > >  	struct xe_gt *gt;
> > > > +	int err;
> > > >  	u8 id;
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	/*
> > > >  	 * If this platform supports GMD_ID, we'll detect the proper IP
> > > >  	 * descriptor to use from hardware registers. desc->graphics will only
> > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > index 9c59a30d7646..a059be07a11d 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > @@ -258,6 +258,7 @@ int xe_pm_init_early(struct xe_device *xe)
> > > >  		return err;
> > > > 
> > > >  	err = drmm_mutex_init(&xe->drm, &xe->d3cold.lock);
> > > > +	err = xe_device_inject_driver_probe_error_override(xe, err);
> > > >  	if (err)
> > > >  		return err;
> > > > 
> > > > @@ -276,6 +277,10 @@ int xe_pm_init(struct xe_device *xe)
> > > >  {
> > > >  	int err;
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	/* For now suspend/resume is only allowed with GuC */
> > > >  	if (!xe_device_uc_enabled(xe))
> > > >  		return 0;
> > > > diff --git a/drivers/gpu/drm/xe/xe_sriov.c b/drivers/gpu/drm/xe/xe_sriov.c
> > > > index 5a1d65e4f19f..c7512d8acc28 100644
> > > > --- a/drivers/gpu/drm/xe/xe_sriov.c
> > > > +++ b/drivers/gpu/drm/xe/xe_sriov.c
> > > > @@ -102,11 +102,13 @@ static void fini_sriov(struct drm_device *drm, void *arg)
> > > >   */
> > > >  int xe_sriov_init(struct xe_device *xe)
> > > >  {
> > > > +	int err;
> > > > +
> > > >  	if (!IS_SRIOV(xe))
> > > >  		return 0;
> > > > 
> > > >  	if (IS_SRIOV_PF(xe)) {
> > > > -		int err = xe_sriov_pf_init_early(xe);
> > > > +		err = xe_sriov_pf_init_early(xe);
> > > > 
> > > >  		if (err)
> > > >  			return err;
> > > > @@ -114,7 +116,8 @@ int xe_sriov_init(struct xe_device *xe)
> > > > 
> > > >  	xe_assert(xe, !xe->sriov.wq);
> > > >  	xe->sriov.wq = alloc_workqueue("xe-sriov-wq", 0, 0);
> > > > -	if (!xe->sriov.wq)
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (!xe->sriov.wq || err)
> > > >  		return -ENOMEM;
> > > > 
> > > >  	return drmm_add_action_or_reset(&xe->drm, fini_sriov, xe);
> > > > diff --git a/drivers/gpu/drm/xe/xe_sriov_pf.c b/drivers/gpu/drm/xe/xe_sriov_pf.c
> > > > index 0f721ae17b26..8d75bb6570f0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_sriov_pf.c
> > > > +++ b/drivers/gpu/drm/xe/xe_sriov_pf.c
> > > > @@ -80,8 +80,14 @@ bool xe_sriov_pf_readiness(struct xe_device *xe)
> > > >   */
> > > >  int xe_sriov_pf_init_early(struct xe_device *xe)
> > > >  {
> > > > +	int err;
> > > > +
> > > >  	xe_assert(xe, IS_SRIOV_PF(xe));
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	return drmm_mutex_init(&xe->drm, &xe->sriov.pf.master_lock);
> > > >  }
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > > > index dda5268507d8..774668ac67b4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_tile.c
> > > > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > > > @@ -114,6 +114,10 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> > > >  {
> > > >  	int err;
> > > > 
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	tile->xe = xe;
> > > >  	tile->id = id;
> > > > 
> > > > @@ -127,6 +131,15 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> > > > 
> > > >  	xe_pcode_init(tile);
> > > > 
> > > > +	/*
> > > > +	 * xe_tile_alloc() and xe_gt_alloc() only fail with -ENOMEM.
> > > > +	 * drmm_zalloc() is used so resources will be freed even if
> > > > +	 * an error is injected.
> > > > +	 */
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > >  	return 0;
> > > >  }
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> > > > index 0d073a9987c2..6eaef7a3c58e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_uc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_uc.c
> > > > @@ -135,6 +135,10 @@ int xe_uc_init_hwconfig(struct xe_uc *uc)
> > > >  {
> > > >  	int ret;
> > > > 
> > > > +	ret = xe_device_inject_driver_probe_error(uc_to_xe(uc));
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > >  	/* GuC submission not enabled, nothing to do */
> > > >  	if (!xe_device_uc_enabled(uc_to_xe(uc)))
> > > >  		return 0;
> > > > diff --git a/drivers/gpu/drm/xe/xe_wa.c b/drivers/gpu/drm/xe/xe_wa.c
> > > > index 28b7f95b6c2f..8baad6106968 100644
> > > > --- a/drivers/gpu/drm/xe/xe_wa.c
> > > > +++ b/drivers/gpu/drm/xe/xe_wa.c
> > > > @@ -825,6 +825,11 @@ int xe_wa_init(struct xe_gt *gt)
> > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > >  	size_t n_oob, n_lrc, n_engine, n_gt, total;
> > > >  	unsigned long *p;
> > > > +	int err;
> > > > +
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (err)
> > > > +		return err;
> > > > 
> > > >  	n_gt = BITS_TO_LONGS(ARRAY_SIZE(gt_was));
> > > >  	n_engine = BITS_TO_LONGS(ARRAY_SIZE(engine_was));
> > > > @@ -833,7 +838,8 @@ int xe_wa_init(struct xe_gt *gt)
> > > >  	total = n_gt + n_engine + n_lrc + n_oob;
> > > > 
> > > >  	p = drmm_kzalloc(&xe->drm, sizeof(*p) * total, GFP_KERNEL);
> > > > -	if (!p)
> > > > +	err = xe_device_inject_driver_probe_error(xe);
> > > > +	if (!p || err)
> > > >  		return -ENOMEM;
> > > > 
> > > >  	gt->wa_active.gt = p;
> > > > diff --git a/drivers/gpu/drm/xe/xe_wopcm.c b/drivers/gpu/drm/xe/xe_wopcm.c
> > > > index d3a99157e523..70674b30c4c6 100644
> > > > --- a/drivers/gpu/drm/xe/xe_wopcm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_wopcm.c
> > > > @@ -206,6 +206,10 @@ int xe_wopcm_init(struct xe_wopcm *wopcm)
> > > >  	bool locked;
> > > >  	int ret = 0;
> > > > 
> > > > +	ret = xe_device_inject_driver_probe_error(xe);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > >  	if (!guc_fw_size)
> > > >  		return -EINVAL;
> > > > 
> > > > @@ -252,8 +256,9 @@ int xe_wopcm_init(struct xe_wopcm *wopcm)
> > > >  		guc_wopcm_base / SZ_1K, guc_wopcm_size / SZ_1K);
> > > > 
> > > >  check:
> > > > +	ret = xe_device_inject_driver_probe_error_override(xe, ret);
> > > >  	if (__check_layout(xe, wopcm->size, guc_wopcm_base, guc_wopcm_size,
> > > > -			   guc_fw_size, huc_fw_size)) {
> > > > +			   guc_fw_size, huc_fw_size) && !ret) {
> > > >  		wopcm->guc.base = guc_wopcm_base;
> > > >  		wopcm->guc.size = guc_wopcm_size;
> > > >  		XE_WARN_ON(!wopcm->guc.base);
> > > > --
> > > > 2.43.0
> > > > 


More information about the Intel-xe mailing list