[RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time

Mon Sep 16 08:45:37 UTC 2024

On Fri, Sep 13, 2024 at 04:15:49PM -0400, Rodrigo Vivi wrote:
> On Fri, Sep 13, 2024 at 04:33:05PM +0200, Francois Dugast wrote:
> > This RFC is a rework of https://patchwork.freedesktop.org/series/137314/
> > but instead of importing custom macros from i915, it makes use of the
> > standard kernel fault injection infrastructure, see fault-injection.rst.
> > 
> > In particular, it leverages error injectable functions with the macro
> > ALLOW_ERROR_INJECTION(). This macro requires minimal code changes to the
> > faulting function, if it meets the injectable functions requirements:
> > fault-injection.html#requirements-for-the-error-injectable-functions
> > 
> > Unfortunately this is not the case in most of the injection points of the
> > original series, so a wrapper is added if needed. Only a few examples are
> > shown in this RFC to discuss the proper pattern to apply.
> > 
> > The return code of the functions using ALLOW_ERROR_INJECTION() can be
> > conditionnally modified at runtime by tuning some debugfs entries. This
> > requires CONFIG_FUNCTION_ERROR_INJECTION (among others).
> > 
> > One way to use fault injection at probe time by making each of those
> > functions fail one at a time is:
> > 
> >     FAILTYPE=fail_function
> >     DEVICE="0000:00:08.0" # depends on the system
> >     ERRNO=-12 # -ENOMEM, other value might be needed depending on error handling
> > 
> >     echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> >     echo 100 > /sys/kernel/debug/$FAILTYPE/probability
> >     echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> >     echo -1 > /sys/kernel/debug/$FAILTYPE/times
> >     echo 0 > /sys/kernel/debug/$FAILTYPE/space
> >     echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
> > 
> >     modprobe xe
> >     echo $DEVICE > /sys/bus/pci/drivers/xe/unbind
> > 
> >     grep -oP "^.* \[xe\]" /sys/kernel/debug/$FAILTYPE/injectable | cut -d ' ' -f 1 | while read -r FUNCTION ; do
> >         echo "Injecting fault in $FUNCTION"
> >         echo "" > /sys/kernel/debug/$FAILTYPE/inject
> >         echo $FUNCTION > /sys/kernel/debug/$FAILTYPE/inject
> >         printf %#x $ERRNO > /sys/kernel/debug/$FAILTYPE/$FUNCTION/retval
> >         echo $DEVICE > /sys/bus/pci/drivers/xe/bind
> >     done
> > 
> >     rmmod xe
> > 
> > Signed-off-by: Francois Dugast <francois.dugast at intel.com>
> > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c | 37 +++++++++++++++++++++++++++++++---
> >  drivers/gpu/drm/xe/xe_tile.c   | 18 +++++++++++++++++
> >  drivers/gpu/drm/xe/xe_wopcm.c  |  3 +++
> >  3 files changed, 55 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 4d3c794f134c..2db62aa45b39 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -6,6 +6,7 @@
> >  #include "xe_device.h"
> >  
> >  #include <linux/delay.h>
> > +#include <linux/fault-inject.h>
> >  #include <linux/units.h>
> >  
> >  #include <drm/drm_aperture.h>
> > @@ -300,6 +301,37 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
> >  	ttm_device_fini(&xe->ttm);
> >  }
> >  
> > +/* Wrapper for fault injection */
> > +static noinline int device_create_ttm_device_init(struct xe_device *xe)
> > +{
> > +	/*
> > +	 * In case of error injection, the flow is modified because ttm_device_init()
> > +	 * executes (likely without error) but the return code is overridden in any
> > +	 * case to indicate an error in ttm_device_init(), which likely did not
> > +	 * occur.
> > +	 */
> > +	return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > +			       xe->drm.anon_inode->i_mapping,
> > +			       xe->drm.vma_offset_manager, false, false);
> > +
> > +	/*
> > +	 * The alternative below would also change the flow because in case of error
> > +	 * injection, ttm_device_init() would not be executed at all.
> > +	 *
> > +	 * int err = device_create_ttm_device_init_inject_fault();
> > +	 * if (err)
> > +	 *	return err;
> > +	 * return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > +	 *		       xe->drm.anon_inode->i_mapping,
> > +	 *		       xe->drm.vma_offset_manager, false, false);
> > +	 *
> > +	 * ...
> > +	 * static noinline int device_create_ttm_device_init_inject_fault() { return 0; }
> > +	 * ALLOW_ERROR_INJECTION(device_create_ttm_device_init_inject_fault, ERRNO);
> > +	 */
> > +}
> > +ALLOW_ERROR_INJECTION(device_create_ttm_device_init, ERRNO);
> > +
> >  struct xe_device *xe_device_create(struct pci_dev *pdev,
> >  				   const struct pci_device_id *ent)
> >  {
> > @@ -316,9 +348,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
> >  	if (IS_ERR(xe))
> >  		return xe;
> >  
> > -	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > -			      xe->drm.anon_inode->i_mapping,
> > -			      xe->drm.vma_offset_manager, false, false);
> > +	err = device_create_ttm_device_init(xe);
> 
> I believe the function name could be simplified to xe_device_init_ttm...

Yes indeed.

> 
> >  	if (WARN_ON(err))
> >  		goto err;
> >  
> > @@ -550,6 +580,7 @@ static int wait_for_lmem_ready(struct xe_device *xe)
> >  
> >  	return 0;
> >  }
> > +ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO);
> 
> I liked how simple that is....

It is quite straightforward when the function is error injectable, which is
often not the case. Especially this requirement is often not met: "The
function does not execute any code which can change any state before the first
error return."

This is why other examples in this patch require an extra wrapper, see below.

> 
> >  
> >  static void update_device_info(struct xe_device *xe)
> >  {
> > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > index dda5268507d8..22c0a42af9c2 100644
> > --- a/drivers/gpu/drm/xe/xe_tile.c
> > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > @@ -3,6 +3,8 @@
> >   * Copyright © 2023 Intel Corporation
> >   */
> >  
> > +#include <linux/fault-inject.h>
> > +
> >  #include <drm/drm_managed.h>
> >  
> >  #include "xe_device.h"
> > @@ -99,6 +101,12 @@ static int xe_tile_alloc(struct xe_tile *tile)
> >  	return 0;
> >  }
> >  
> > +static noinline int tile_init_early_inject_fault(void)
> > +{
> > +	return 0;
> > +}
> > +ALLOW_ERROR_INJECTION(tile_init_early_inject_fault, ERRNO);
> > +
> >  /**
> >   * xe_tile_init_early - Initialize the tile and primary GT
> >   * @tile: Tile to initialize
> > @@ -117,6 +125,16 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> >  	tile->xe = xe;
> >  	tile->id = id;
> >  
> > +	/*
> > +	 * The flow is modified because xe_tile_init_early() fails before the
> > +	 * first possible error, from xe_tile_alloc(). It is does not match the
> > +	 * 2nd requirement of
> > +	 * fault-injection.html#requirements-for-the-error-injectable-functions
> > +	 */
> 
> I'm afraid I didn't understand this.

In this example we end up returning an error from an artificial code location:

    err = tile_init_early_inject_fault();
    if (err)
        return err;

... before even starting to execute the first statement which could return a
real error, xe_tile_alloc(). Not focusing on this particular example but in
general this can have side effects and create false positives if the state
is not changed as expected, either because the first statement that can fail
was not executed at all, or because all was executed successfully but still
reported as error.

Not sure if this is overthinking, after all this issue is also present with
the i915 macro for error injection and maybe the pros outweigh the cons.

Francois

> 
> and the name could perhaps start with fault_inject?!
> 
> or maybe:
> 
> xe_fault_inject_tile_init_early();

Sure, whatever helps easily identify wrappers only added for fault
injection. Maybe we can even remove the xe_ prefix as the function can be
static?

Francois

> 
> > +	err = tile_init_early_inject_fault();
> > +	if (err)
> > +		return err;
> > +
> >  	err = xe_tile_alloc(tile);
> >  	if (err)
> >  		return err;
> > diff --git a/drivers/gpu/drm/xe/xe_wopcm.c b/drivers/gpu/drm/xe/xe_wopcm.c
> > index 93c82825d896..88a201122a22 100644
> > --- a/drivers/gpu/drm/xe/xe_wopcm.c
> > +++ b/drivers/gpu/drm/xe/xe_wopcm.c
> > @@ -5,6 +5,8 @@
> >  
> >  #include "xe_wopcm.h"
> >  
> > +#include <linux/fault-inject.h>
> > +
> >  #include "regs/xe_guc_regs.h"
> >  #include "xe_device.h"
> >  #include "xe_force_wake.h"
> > @@ -268,3 +270,4 @@ int xe_wopcm_init(struct xe_wopcm *wopcm)
> >  
> >  	return ret;
> >  }
> > +ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO);
> > -- 
> > 2.43.0
> >