[Mesa-dev] [PATCH 1/2] gallium: Add PIPE_CAP_USER_MEMORY_PAGE_SIZE for page size of user pointers

Thu Aug 17 18:17:11 UTC 2017

Hi,

thanks a lot for these answers.

On Thu, 2017-08-17 at 15:01 +0200, Christian König wrote:
> Am 17.08.2017 um 13:54 schrieb Jan Vesely:
> > On Thu, 2017-08-17 at 11:54 +0200, Christian König wrote:
> > > [SNIP]
> > > In general ATS works completely different to GPUVM and is rather bound
> > > to the CPU page tables.
> > > 
> > > But GPUVM on everything before Vega10 has a so called fragmentation size
> > > in their page table entries which tell the TLB that a certain bunch of
> > > them are consecutive and so only one of them needs to be fetched and cached.
> > 
> > did pre-Vega GPUVM have the x86 style multilevel (4-5) structure of
> > page tables?
> 
> No, not even remotely. GPUVM page tables on pre-Vega can only deal with 
> two levels, some blocks like display can even only handle one or start 
> to run into problems.

Can you share the motivation to change this? was it done to accommodate
sparse mappings, or ease synchronizing with CPU pagetables (HMM like)?
Given the problems you mentioned below the change is counter intuitive

> 
> >   could fragmentation size go above the limit of one level?
> 
> I think so, but I never confirmed with the hardware guys. The maximum 
> fragment size is 1 or 2GB IIRC and that's normally way larger than a 
> single page table.
> 
> > > After Vega10 we more or less have the same as on x86_64 CPUs where you
> > > set a bit in the page directory entry to stop the fetcher and use that
> > > address instead. This way you not only make the TLB much faster, but
> > > also save the last layer in the page table tree.
> > 
> > I assumed most of the benefits of large pages came from increased TLB
> > coverage. Do shorted page table walks bring significant performance
> > impact?
> 
> Never measured it, but I would strongly assume so. See when you can skip 
> the last level of a page table walk in a four level tree you basically 
> make each full walk 25% more efficient.

That makes sense, but high impact of PTW time on overall performance
would imply poor TLB/MMU$ hit rate. Was the change targeted at graphics
or compute workloads?

> 
> > Does it mean that GPUVM does not have PTW prefix caches?
> 
> Vega10 does have a cache for page directory entries, but we have seen 
> significant improvement when we stopped to use that and instead used the 
> L2 with 2MB pages.

Does this perform better than using old style fragment based 2MB pages?

> > sorry for the flurry of questions. I never looked into how GPUVM worked
> >   and assumed it used design choices tailored to benefit graphics
> > workloads. The "cpu-ization" of the address translation hierarchy is
> > rather interesting.
> 
> Well it's certainly a hot topic, cause it can affect memory throughput 
> significantly.

Your answers are greatly appreciated. I always assumed that graphic
workload requirements would be at odds with SVM cpu/gpu integration, so
it's interesting to find out if it might not be the case. (It'd also be
interesting to know what impact did the cache hierarchy changes in GCN
have on graphics vs. compute performance)

thanks,
Jan

> 
> Regards,
> Christian.
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

-- 
Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170817/139b2bb0/attachment.sig>