[PATCH] drm/xe/vm: don't ignore error when in_kthread

Thu Feb 8 09:22:02 UTC 2024

On Mon, 2024-02-05 at 18:41 +0000, Matthew Brost wrote:
> On Fri, Feb 02, 2024 at 05:14:36PM +0000, Matthew Auld wrote:
> > If GUP fails and we are in_kthread, we can have pinned = 0 and ret
> > = 0.
> > If that happens we call sg_alloc_append_table_from_pages() with
> > n_pages
> > = 0, which is not well behaved and can trigger:
> > 
> > kernel BUG at include/linux/scatterlist.h:115!
> > 
> > depending on if the pages array happens to be zeroed or not. Even
> > if we
> > don't hit that it crashes later when trying to dma_map the returned
> > table.
> > 
> > Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> 
> Someone from Habana point this out a while back and forgot to follow
> up
> on fixing this. Thanks for fixing this and looks correct.
> 
> Should we include a Fixes tag here? I am thinking so.
> 
> With a fixes tag:
> Reviewed: Matthew Brost <matthew.brost at intel.com>

Hi, 
Matt + Matt

I think this requires yet another fix. The reason for this odd
construct was that on process exit (CTRL-C), the userptr mappings are
torn down, leading to an -EFAULT here. This is then propagated to the
rebind worker and we get a printout like

[  188.922692] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922913] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922943] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922948] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922952] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922956] xe 0000:03:00.0: [drm] VM worker error: -14
[  188.922960] xe 0000:03:00.0: [drm] VM worker error: -14

(xe-exec-threads --r threads-cm-userptr-invalidate-race + CTRL-C)

And the idea was that the rebind worker just re-enabled without setting
up these bindings. If any job was then still accessing this address (it
shouldn't at this point, right?) we'd catch this with an IOMMU
pagefault or similar.

But in any case, we need to filter out the above log spamming.

/Thomas

> 
> > ---
> >  drivers/gpu/drm/xe/xe_vm.c | 5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index 9c1c68a2fff7..63aeb3aead04 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -114,11 +114,8 @@ int xe_vma_userptr_pin_pages(struct
> > xe_userptr_vma *uvma)
> >  					  num_pages - pinned,
> >  					  read_only ? 0 :
> > FOLL_WRITE,
> >  					  &pages[pinned]);
> > -		if (ret < 0) {
> > -			if (in_kthread)
> > -				ret = 0;
> > +		if (ret < 0)
> >  			break;
> > -		}
> >  
> >  		pinned += ret;
> >  		ret = 0;
> > -- 
> > 2.43.0
> >