[PATCH 0/5] radeon: Write-combined CPU mappings of BOs in GTT

Tue Jul 22 23:42:41 PDT 2014

Am 23.07.2014 05:54, schrieb Michel Dänzer:
> On 21.07.2014 17:07, Christian König wrote:
>> Am 19.07.2014 03:15, schrieb Michel Dänzer:
>>> On 19.07.2014 00:47, Christian König wrote:
>>>> Am 18.07.2014 05:07, schrieb Michel Dänzer:
>>>>>>> [PATCH 5/5] drm/radeon: Use VRAM for indirect buffers on >= SI
>>>>>> I'm still not very keen with this change since I still don't
>>>>>> understand
>>>>>> the reason why it's faster than with GTT. Definitely needs more
>>>>>> testing
>>>>>> on a wider range of systems.
>>>>> Sure. If anyone wants to give this patch a spin and see if they can
>>>>> measure any performance difference, good or bad, that would be
>>>>> interesting.
>>>>>
>>>>>> Maybe limit it to APUs for now?
>>>>> But IIRC, CPU writes to VRAM vs. write-combined GTT are actually an
>>>>> even
>>>>> bigger win with dedicated GPUs than with the Kaveri built-in GPU on my
>>>>> system. I suspect it may depend on the bandwidth available for PCIe vs.
>>>>> system memory though.
>>>> I've made a few tests today with the kernel part of the patches running
>>>> Xonotic on Ultra in 1920 x 1080.
>>>>
>>>> Without any patches I get around ~47.0fps on average with my dedicated
>>>> HD7870.
>>>>
>>>> Adding only "drm/radeon: Use write-combined CPU mappings of rings and
>>>> IBs on >= SI" and that goes down to ~45.3fps.
>>>>
>>>> Adding on to off that "drm/radeon: Use VRAM for indirect buffers on >=
>>>> SI" and the frame rate goes down to ~27.74fps.
>>> Hmm, looks like I'll need to do more benchmarking of 3D workloads as
>>> well.
> I haven't been able to consistently[0] measure any significant
> difference between all placements of the rings and IBs with Xonotic or
> Reaction Quake with my Bonaire. I'd expect Xonotic to be shader / GPU
> memory bandwidth bound rather than CS bound anyway, so a ~40% hit from
> that kernel patch alone is very surprising. Are you sure it wasn't just
> the same kind of variation as described below?

Yes, I've measured that multiple times and the results where quite 
consistent.

But I didn't measured it on a Bonaire, where the bottleneck probably 
isn't the CPU load. I measured it on a fast Pitcairn and there Xonotic 
was clearly affected by the patches.

>
> [0] There were slightly different results sometimes, but next time I
> tried the same setup again, it was back to the same as always. So it
> seemed to depend more on the particular system boot / test run / moon
> phase / ... than the kernel patches themselves.
>
>
>>> Alex, given those numbers, it's probably best if you remove the "Use
>>> write-combined CPU mappings of rings and IBs on >= SI" change from your
>>> tree as well for now.
>> I wouldn't go as far as reverting the patch. It just needs a bit more
>> fine tuning and that can happen in the 3.17rc cycle.
> There's no need to revert it, just drop it from the tree. I'd still
> prefer that for now.
>
>
>> My tests clearly show that we still can use USWC for the ring buffer on
>> SI and probably earlier chips as well.
> Yeah, that might be the safest approach for now.
How about using USWC for the rings on all chips since R600 and for the 
IB only on CIK? As far as I can see that should do the trick quite well.

Christian.