[PATCH] drm/xe: Thread prefetch of SVM ranges

Tue Jun 17 14:30:30 UTC 2025

On Mon, Jun 16, 2025 at 06:06:51AM -0600, Mrozek, Michal wrote:
> > >>> > 2) Do we actually *want* to use 5 CPU cores for this?
> > >>>
> > >>> Yes, I profiled this with a test issuing 64MB prefetches—5 threads was
> > >>> ideal. I have a comment in the code about this. Once [1] lands, we’ll
> > >>> likely only need 2 threads on BMG. That would probably get us to a bus
> > >>> 8× faster than BMG; for 16×, we might need more threads. But I think
> > >>> we’ll always want at least 2, as there will always be some CPU
> > >>> overhead that limits copy bandwidth due to serialization.
> > >>
> > >>What I meant was IIRC NEO has previously been picky about starting
> > >>threads. Perhaps Michal can enlighten us here?
> 
> Multiple threads only gives benefits if we are able to overlap things that would otherwise keep the hardware idle.
> i.e. if single CPU thread is able to saturate system -> Vram bandwidth then there is no point to have multiple threads doing the same as
> all of those due to link sharing would end up later and we actually increase latencies instead of reducing those.
> 
> Simple example, if single thread saturate a link and whole copy operation is 1ms, then if you:
> - run 5 copies concurrently, all of those finish at 5ms mark due to link sharing
> - run 5 copies sequentially, one at a time, then first finish at 1ms, second at 2ms, third at 3ms and so on and this allows to unblock consumers way faster
> 
> Hence I would be very careful to use 5 threads to do CPU copies concurrently.
> Also you may explore vector intrinsics to do the transfers, sample -> https://github.com/pmodels/mpich/blob/27229e089554fee8ac0ac9da28e56fa7dc648a45/src/mpl/src/gpu/mpl_gpu_ze.c#L3345
> 

The bottleneck lies in the migrate_vma_* functions, which take longer
than the copy job. A single 2MB copy can reach 16 GB/s, but it must be
placed between migrate_vma_setup and migrate_vma_finalize. These steps
currently take approximately 310 µs, compared to around 130 µs for the
copy itself, which severely impacts prefetch performance—effectively
reducing it to 4 GB/s.

This was tested with prefetch benchmark in the following IGT series [1].

[1] https://patchwork.freedesktop.org/patch/658835/?series=150306&rev=1

> In general I would advise to do at most 2 copies concurrently to overlap on ramp up / ramp down between copies where machine can potentially go idle.
> To much copy parallelism may give diminishing returns, especially for larger 2MB pages.
>

Once the migrate_vma_* are faster (e.g., we 2M device pages), we should
only need 2 threads to hit copy bandwidth. I think this should scale to
a bus 8x than BMG if 2M device pages give us the speedup I am expecting.

I can change this series to use 2 threads only which I'd suspect the
prefetch bandwidth would be ~8 GB/s for now if using less threads is
preferred.

Matt

> For 4KB transfer I agree we may be bottlenecked more by copy engine inefficiency and running multiple (2) small copies may give some nice results.
> 
> And also we should be pretty conservative in using CPU threads especially in higher numbers, as at scale if we take too much threads we may introduce imbalance in the system which would create baubles and compromise performance due to butterfly effect.
> 
>