[v2] drm/mgag200: Add a workaround for low-latency

Tue Mar 12 14:25:08 UTC 2024

On 12/03/2024 13:56, Sui Jingfeng wrote:
> Hi,
> 
> 
> Interesting patch! I know this patch already merged.
> While study this patch, I have a few questions.
> 
> 
> On 2024/2/8 17:51, Jocelyn Falempe wrote:
>> We found a regression in v5.10 on real-time server, using the
>> rt-kernel and the mgag200 driver. It's some really specialized
>> workload, with <10us latency expectation on isolated core.
>> After the v5.10, the real time tasks missed their <10us latency
>> when something prints on the screen (fbcon or printk)
>>
>> The regression has been bisected to 2 commits:
>> commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
>> commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
>>
>> The first one changed the system memory framebuffer from Write-Combine
>> to the default caching.
>> Before the second commit, the mgag200 driver used to unmap the
>> framebuffer after each frame, which implicitly does a cache flush.
> 
> 
> I don't know why it need to do a cache flush, where is the code.
> I'm asking because I want to study this technique.
> 
> Generally speaking, X86-64 platform's default page caching is cached.
> And I think the cached mapping is fastest for software rendering. And
> the platform guaranteed the coherency for us, right?
> 
> Because X86-64 platform(or CPU)'s write buffer is implemented on the
> top of cache? I'm means that for ARM(or other) CPU, when using 
> Write-combine
> the data will has nothing to do with cache.
> 
>> Both regressions are fixed by this commit, which restore WC mapping
>> for the framebuffer in system memory, and add a cache flush.
> 
> So switch back to WC probably will decrease overall performance, I think.
> And the cache flush operation should not have a impact. Except X86-64's
> Write-Combine is different other platform's Write-Combine?

Yes this patch is a bit weird. Usually you want your VRAM mapping to be 
Write-Combine. Here it also set the system memory framebuffer as 
Write-Combine. On most x86-64, Write Combine uses its own hardware 
buffer that is not in L1/L2/L3. So when it copies the framebuffer from 
WC system memory to VRAM, it doesn't involve the cache, and have less 
impact on latency for other tasks running on other CPU.
Also I think the cache flush is important to flush those WC buffers, so 
when the next frame comes, it won't have to wait for the buffers to be 
copied to the slow VRAM.
When running the latency tests, it's obvious that both are needed.
This is how I understand it, but I may be wrong.

-- 

Jocelyn

> 
> 
>> This is only needed on x86_64, for low-latency workload,
>> so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
>> PREEMPT_RT and X86.
>>
>> For more context, the whole thread can be found here [1]
>>
>> Signed-off-by: Jocelyn Falempe <jfalempe at redhat.com>
>> Link: 
>> https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1
>> Acked-by: Thomas Zimmermann <tzimmermann at suse.de>
>