[RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
Matthew Brost
matthew.brost at intel.com
Mon Aug 18 16:25:20 UTC 2025
On Mon, Aug 18, 2025 at 01:07:26PM -0300, Jason Gunthorpe wrote:
> On Sat, Aug 09, 2025 at 03:51:32PM +0200, Thomas Hellström wrote:
> > GPU use-cases for mmu_interval_notifiers with hmm often involve
> > starting a gpu operation and then waiting for it to complete.
> > These operations are typically context preemption or TLB flushing.
> >
> > With single-pass notifiers per GPU this doesn't scale in
> > multi-gpu scenarios. In those scenarios we'd want to first start
> > preemption- or TLB flushing on all GPUs and as a second pass wait
> > for them to complete on all gpus.
>
> The idea seems reasonable but I'm not sure I like the naming of
> 'multipass' or necessarily the complexity.
>
> This is sort of a co-operative multithreading thing.
>
> Do you really need a linked list here? At least justify the design
> choices in the commit message..
>
I think this choice makes sense: it allows embedding the wait state from
the initial notifier call into the pass structure. Patch [6] shows this
by attaching the issued TLB invalidation fences to the pass. Since a
single notifier may be invoked multiple times with different ranges but
the same seqno, I think this is the correct design choice—otherwise
there’s no unambiguous way to track per-invocation wait state. I agree
this should be documented in both the commit message and kernel-doc.
Matt
[6] https://patchwork.freedesktop.org/patch/667844/?series=152725&rev=1
> > +struct mmu_interval_notifier_pass {
> > + struct list_head link;
> > + /**
> > + * @pass: Driver callback for additionall pass.
> > + * @additional_pass: Pointer to the mmu_interval_notifier_pass structure.
> > + * @range: The mmu_notifier_range.
> > + * @cur_seq: The current sequence set by the first pass.
> > + *
> > + * Return: Either a pointer to a valid mmu_interval_notifier_pass for
> > + * another pass to be called, or %NULL if processing is complete for this
> > + * notifier. There is no error reporting mechanism for additional passes.
> > + */
> > + struct mmu_interval_notifier_pass *
> > + (*pass) (struct mmu_interval_notifier_pass *additional_pass,
> > + const struct mmu_notifier_range *range,
> > + unsigned long cur_seq);
> > +};
> > +
> > /**
> > * struct mmu_interval_notifier_ops
> > * @invalidate: Upon return the caller must stop using any SPTEs within this
> > @@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
> > bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
> > const struct mmu_notifier_range *range,
> > unsigned long cur_seq);
> > + bool (*invalidate_multipass)(struct mmu_interval_notifier *interval_sub,
> > + const struct mmu_notifier_range *range,
> > + unsigned long cur_seq,
> > + struct mmu_interval_notifier_pass **pass);
>
> Couldn't this just have a pass number counter and some return code to
> indicate this notifier is done?
>
> Or do you really need more than 2 passes? Start/finish make sense
> too. Otherwise you may have issues overlapping the backgroundable
> operations between different driver types?
>
> > +static void mn_itree_additional_passes(struct list_head *additional_passes,
> > + const struct mmu_notifier_range *range,
> > + unsigned long cur_seq)
> > +{
> > + struct mmu_interval_notifier_pass *p, *next;
> > +
> > + while (!list_empty(additional_passes)) {
> > + list_for_each_entry_safe(p, next, additional_passes, link) {
> > + list_del_init(&p->link);
> > + p = p->pass(p, range, cur_seq);
> > + if (p)
> > + list_add_tail(&p->link, additional_passes);
> > + }
> > + }
>
> Like this is very naive, if one driver has only 'prepare' and 'wait
> for device ack' passes, then it will immediately stop being concurrent
> while another device may be still working on its 3rd pass.
>
> So either this should be more complicated to properly support
> different numbers of passes per registration or we should just support
> two passes and be done with it?
>
> Jason
More information about the Intel-xe
mailing list