[Mesa-dev] [PATCH 0/8] Gallium & RadeonSI optimization for Ryzen CPUs

Thu Sep 6 20:56:31 UTC 2018

Yeah by pinning to cores, I meant to group of cores.

I think a reasonable policy would be for the kernel to put all threads 
of a given process on the same L3
as long as the number of threads is lower than the L3 group size.
When there is more threads I guess it'd need heuristics to pick which 
threads to put together.

I fear if we begin to do the work manually, there won't be interest to 
do that in the kernel,
and thus all applications will need to include such core pinning code to 
have good performance when
multithreaded.

Axel

On 9/6/18 9:21 PM, Marek Olšák wrote:
> Actually, you make a good point about the kernel, but the kernel has
> no visibility into which threads need to be coupled together. So the
> kernel can't do anything.
>
> Marek
>
> On Thu, Sep 6, 2018 at 2:24 PM, Marek Olšák <maraeo at gmail.com> wrote:
>> I think you are missing the point. This series doesn't pin threads to
>> cores. It pins threads to one L3, which can have 4 or 8 cores.
>>
>> Marek
>>
>> On Thu, Sep 6, 2018 at 5:22 AM, Axel Davy <davyaxel0 at gmail.com> wrote:
>>> Hi Marek,
>>>
>>> Shouldn't this core pinning be handled by the kernel ?
>>>
>>> Else all multithreaded games (or applications) need an update.
>>>
>>> I also see a risk in applications handling the core pinning: several
>>> intensive applications
>>> may pin the same cores. The kernel would be able to switch automatically
>>> the pinned cores if load would be better shared among cores.
>>>
>>> Yours,
>>>
>>> Axel Davy
>>>
>>>
>>> On 9/6/18 6:02 AM, Marek Olšák wrote:
>>>> Hi,
>>>>
>>>> When the Ryzen CPUs were launched, they didn't perform very well in
>>>> games, and it took a while before games were patched. Guess what,
>>>> Mesa drivers have suffered from the same inefficincies until now.
>>>>
>>>> The AMD Zen architecture has multiple core complexes (CCX) where each
>>>> CCX has e.g. 4C/8T and always one L3 cache. If application and driver
>>>> threads don't run on the same CCX, communication between threads is
>>>> slow, because multiple L3 caches must maintain coherency between them.
>>>> Atomic operations seem to suffer the most, almost as if they were
>>>> uncached. (are they?)
>>>>
>>>> This series pins the application thread and all driver execution
>>>> threads to 1 L3 cache (1 CCX). If the application thread is already
>>>> pinned to a hw thread or core(s), all driver threads are pinned to
>>>> the same L3 cache (CCX) as the application thread.
>>>>
>>>> Shader compiler threads are unpinned, as they are not critical.
>>>>
>>>> The piglit/drawoverhead microbenchmark shows that this increases
>>>> performance by 32% for DrawElements and 25% for DrawArrays on Ryzen
>>>> 1st-Gen CPUs. It will probably be much less with real apps.
>>>>
>>>> Please review.
>>>>
>>>> Thanks,
>>>> Marek
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>