[PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths
Matthew Brost
matthew.brost at intel.com
Fri Aug 9 16:40:22 UTC 2024
On Fri, Aug 09, 2024 at 03:53:20PM +0200, Thomas Hellström wrote:
> On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote:
> > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> > > Use fault-injection to test partial TTM swapout and interrupted
> > > swapin.
> > > Return -EINTR for swapin to test the callers ability to handle and
> > > restart the swapin, and on swapout perform a partial swapout to
> > > test that
> > > the swapin and release_shrunken functionality.
> > >
> > > Cc: Christian König <christian.koenig at amd.com>
> > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: <dri-devel at lists.freedesktop.org>
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > ---
> > > drivers/gpu/drm/Kconfig | 10 ++++++++++
> > > drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
> > > 2 files changed, 26 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > > index fd0749c0c630..9f27271bfab8 100644
> > > --- a/drivers/gpu/drm/Kconfig
> > > +++ b/drivers/gpu/drm/Kconfig
> > > @@ -272,6 +272,16 @@ config DRM_GPUVM
> > > GPU-VM representation providing helpers to manage a GPUs
> > > virtual
> > > address space
> > >
> > > +config DRM_TTM_BACKUP_FAULT_INJECT
> > > + bool "Enable fault injection during TTM backup"
> > > + depends on DRM_TTM
> > > + default n
> > > + help
> > > + Inject recoverable failures during TTM backup and
> > > recovery of
> > > + backed-up objects. For DRM driver developers only.
> > > +
> > > + If in doubt, choose N.
> > > +
> > > config DRM_BUDDY
> > > tristate
> > > depends on DRM
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index 38e50cf81b0a..d32a1f2e5e50 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > > struct ttm_backup *backup,
> > > struct ttm_operation_ctx *ctx)
> > > {
> > > + static unsigned long __maybe_unused swappedin;
> > > unsigned int i, nr = 1 << restore->order;
> > > int ret = 0;
> > >
> > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > > if (handle == 0)
> > > continue;
> > >
> > > + if
> > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> > > + ctx->interruptible &&
> > > + ++swappedin % 100 == 0) {
> > > + ret = -EINTR;
> > > + break;
> > > + }
> >
> > So here this -EINTR would be kicked to the user IOCTL which triggered
> > the BO validate and retry? The restore then should be able to
> > successfully pick up where it left off?
>
> Yes, that's the point. For the direct swap-cache backend I initially
> used (before concluding that the shmem one actually seemed to work
> fine), we had an interruptible wait here.
>
> Supporting interrupts is generally a good thing but for the pool code,
> this makes the code considerably more complicated. However, this is a
> good way to ensure drivers actually support -EINTR for the call chain.
> If not, adding interrupt capability "later" will most likely be a PITA.
>
> >
> > > +
> > > ret = backup->ops->copy_backed_up_page
> > > (backup, restore->first_page[i],
> > > handle, ctx->interruptible);
> > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool,
> > > struct ttm_tt *ttm, bool purge,
> > >
> > > alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > > __GFP_RETRY_MAYFAIL;
> > >
> > > - for (i = 0; i < ttm->num_pages; ++i) {
> > > + num_pages = ttm->num_pages;
> > > +
> > > + /* Pretend doing fault injection by shrinking only half of
> > > the pages. */
> > > +
> > > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> > > + num_pages = DIV_ROUND_UP(num_pages, 2);
> >
> > So what happens here? Half the pages swapped out, then upon restore
> > half
> > swapped back in? The shrinker continues to walk until enough pages
> > swapped out?
>
> Yes, exactly. Ideally we'd want some intermediate state here so that a
> partially swapped out bo is still eligible for further shrinking.
>
Cool, glad my understanding is correct. Having error injection is always
a good idea to ensure error paths / corner cases work rather than
finding they blow up much later when these cases somehow get triggered.
With that:
Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> /Thomas
>
>
> >
> > Matt
> >
> > > +
> > > + for (i = 0; i < num_pages; ++i) {
> > > page = ttm->pages[i];
> > > if (unlikely(!page))
> > > continue;
> > > --
> > > 2.44.0
> > >
>
More information about the Intel-xe
mailing list