drm/radeon/kms: improve performance of blit-copy

Ilija Hadzic ihadzic at research.bell-labs.com
Wed Oct 12 20:29:33 PDT 2011


The following set of patches will improve the performance
of blit-copy functions for Radeon GPUs based on 
R600, R700, Evergreen and NI ASICs.

The foundation for improvement is the use of tiled mode access
(which for copying bo's can be used regardless of whether the
content is tiled or not), and segmenting the memory block
being copied into rectangles whose edge ratio is between 1:1
and 1:2. This maximizes the number of PCIe transactions that
use maximum payload size (typically 128 bytes) and also 
creates a memory access pattern that is more favorable for
both VRAM and host DRAM than what's currently in the kernel.

To come up with the new blit-copy code, I did a lot of 
PCIe traffic analysis with the bus analyzer and also 
had many discussions with Alex, trying to explain what's 
going on (thanks to Alex for his time).

Below (at the end of this note) are the results of some benchmarks
that I did with various GPUs (all in the same host: Intel i7 CPU,
X58 chipset, three DRAM channels). To run the tests on your machine
load the radeon module with 'benchmark=1 pcie_gen2=1' parameters.
Most significant improvement is in the upstream (VRAM to GART)
direction because that's where the PCIe transactions were fragmented 
and also where memory access pattern was such that it created a lot of 
backpressure from the host.

It is also interesting that high-end devices (e.g. Cayman) exhibit
the least improvement and were the worst to begin with. This is
because high-end devices copy more tiles in parallel which 
in turn can create bank conflicts on host memory and cause the
host to do lots of bank-close/precharge/bank-open cycles. 

As an added "bonus", I also did some code cleanup and consolidated
the repeated code into common function, so r600 and evergreen/NI
parts now share the blit-copy code. I also expanded on the
benchmark coverage, so the module now takes benckmark parameter
value between 1 and 8 and each results in running a different 
benchmark.

For details, see the commit log messages and the code.
I have been running with these patches for a few months 
(and I kept rebasing them to drm-core-next as the public 
git progressed) and I used them in a system setup that does
*many* copying of this kind (and does them frequently); I 
have not seen instabilities introduced by these patches. I also
verified the correctness of the copy using test=1 parameter
for each GPU that I had and the test passed.

I would welcome some feedback and if you run the benchmarks
with the new blit code, I would very much like to hear
what kind of improvement you are seeing.


BENCHMARK RESULTS:
==================

1) VRAM to GTT 
==============

Card (ASIC)	VRAM		Before	After
---------------------------------------------
5570 (Redwood)	DDR3 1600MHZ	 454	3912
6450 (Caicos)	DDR5 3200MHz	3718	5090
6570 (Turks)	DDR3 1800MHz	 484	4144
5450 (Cedar)	DDR3 1600MHz	3679	5090
5450 (Cedar)	DDR2  800MHz	2695	4639
E4690 (RV730)	DDR3 1400MHZ	 485	4969
E6760 (Turks)	DDR5 3200MHz	 474	4177
V5700 (RV730)	DDR3 ????MHz	 488	4297
2260 (RV620)	DDR2 ????MHz	 494	3093
6870 (Barts)	DDR5 4200MHz	 475	1113
6970 (Cayman)	DDR5 4200MHz	 473	 710

2) GTT to VRAM
==============

Card (ASIC)	VRAM		Before	After
---------------------------------------------
5570 (Redwood)	DDR3 1600MHz	3158	3360
6450 (Caicos)	DDR5 3200MHz	2995	3393
6570 (Turks)	DDR3 1800MHz	3039	3339
5450 (Cedar)	DDR3 1600MHz	3246	3404
5450 (Cedar)	DDR2  800MHz	2614	3371
E4690 (RV730)	DDR3 1400MHz 	3084	3426
E6760 (Turks)	DDR5 3200MHz	2443	2570
V5700 (RV730)	DDR3 ????MHz	3187	3506	
2260 (RV620)	DDR2 ????MHz	 584	3246
6870 (Barts)	DDR5 4200MHz	2472	2601
6970 (Cayman)	DDR5 4200MHz	2460	2737


More information about the dri-devel mailing list