[PATCH] drm/xe: Thread prefetch of SVM ranges

Mrozek, Michal michal.mrozek at intel.com
Mon Jun 16 12:06:51 UTC 2025


> >>> > 2) Do we actually *want* to use 5 CPU cores for this?
> >>>
> >>> Yes, I profiled this with a test issuing 64MB prefetches—5 threads was
> >>> ideal. I have a comment in the code about this. Once [1] lands, we’ll
> >>> likely only need 2 threads on BMG. That would probably get us to a bus
> >>> 8× faster than BMG; for 16×, we might need more threads. But I think
> >>> we’ll always want at least 2, as there will always be some CPU
> >>> overhead that limits copy bandwidth due to serialization.
> >>
> >>What I meant was IIRC NEO has previously been picky about starting
> >>threads. Perhaps Michal can enlighten us here?

Multiple threads only gives benefits if we are able to overlap things that would otherwise keep the hardware idle.
i.e. if single CPU thread is able to saturate system -> Vram bandwidth then there is no point to have multiple threads doing the same as
all of those due to link sharing would end up later and we actually increase latencies instead of reducing those.

Simple example, if single thread saturate a link and whole copy operation is 1ms, then if you:
- run 5 copies concurrently, all of those finish at 5ms mark due to link sharing
- run 5 copies sequentially, one at a time, then first finish at 1ms, second at 2ms, third at 3ms and so on and this allows to unblock consumers way faster

Hence I would be very careful to use 5 threads to do CPU copies concurrently.
Also you may explore vector intrinsics to do the transfers, sample -> https://github.com/pmodels/mpich/blob/27229e089554fee8ac0ac9da28e56fa7dc648a45/src/mpl/src/gpu/mpl_gpu_ze.c#L3345

In general I would advise to do at most 2 copies concurrently to overlap on ramp up / ramp down between copies where machine can potentially go idle.
To much copy parallelism may give diminishing returns, especially for larger 2MB pages.

For 4KB transfer I agree we may be bottlenecked more by copy engine inefficiency and running multiple (2) small copies may give some nice results.

And also we should be pretty conservative in using CPU threads especially in higher numbers, as at scale if we take too much threads we may introduce imbalance in the system which would create baubles and compromise performance due to butterfly effect.




More information about the Intel-xe mailing list