drm/radeon/kms: improve performance of blit-copy
Ilija Hadzic
ihadzic at research.bell-labs.com
Wed Oct 12 20:29:33 PDT 2011
The following set of patches will improve the performance
of blit-copy functions for Radeon GPUs based on
R600, R700, Evergreen and NI ASICs.
The foundation for improvement is the use of tiled mode access
(which for copying bo's can be used regardless of whether the
content is tiled or not), and segmenting the memory block
being copied into rectangles whose edge ratio is between 1:1
and 1:2. This maximizes the number of PCIe transactions that
use maximum payload size (typically 128 bytes) and also
creates a memory access pattern that is more favorable for
both VRAM and host DRAM than what's currently in the kernel.
To come up with the new blit-copy code, I did a lot of
PCIe traffic analysis with the bus analyzer and also
had many discussions with Alex, trying to explain what's
going on (thanks to Alex for his time).
Below (at the end of this note) are the results of some benchmarks
that I did with various GPUs (all in the same host: Intel i7 CPU,
X58 chipset, three DRAM channels). To run the tests on your machine
load the radeon module with 'benchmark=1 pcie_gen2=1' parameters.
Most significant improvement is in the upstream (VRAM to GART)
direction because that's where the PCIe transactions were fragmented
and also where memory access pattern was such that it created a lot of
backpressure from the host.
It is also interesting that high-end devices (e.g. Cayman) exhibit
the least improvement and were the worst to begin with. This is
because high-end devices copy more tiles in parallel which
in turn can create bank conflicts on host memory and cause the
host to do lots of bank-close/precharge/bank-open cycles.
As an added "bonus", I also did some code cleanup and consolidated
the repeated code into common function, so r600 and evergreen/NI
parts now share the blit-copy code. I also expanded on the
benchmark coverage, so the module now takes benckmark parameter
value between 1 and 8 and each results in running a different
benchmark.
For details, see the commit log messages and the code.
I have been running with these patches for a few months
(and I kept rebasing them to drm-core-next as the public
git progressed) and I used them in a system setup that does
*many* copying of this kind (and does them frequently); I
have not seen instabilities introduced by these patches. I also
verified the correctness of the copy using test=1 parameter
for each GPU that I had and the test passed.
I would welcome some feedback and if you run the benchmarks
with the new blit code, I would very much like to hear
what kind of improvement you are seeing.
BENCHMARK RESULTS:
==================
1) VRAM to GTT
==============
Card (ASIC) VRAM Before After
---------------------------------------------
5570 (Redwood) DDR3 1600MHZ 454 3912
6450 (Caicos) DDR5 3200MHz 3718 5090
6570 (Turks) DDR3 1800MHz 484 4144
5450 (Cedar) DDR3 1600MHz 3679 5090
5450 (Cedar) DDR2 800MHz 2695 4639
E4690 (RV730) DDR3 1400MHZ 485 4969
E6760 (Turks) DDR5 3200MHz 474 4177
V5700 (RV730) DDR3 ????MHz 488 4297
2260 (RV620) DDR2 ????MHz 494 3093
6870 (Barts) DDR5 4200MHz 475 1113
6970 (Cayman) DDR5 4200MHz 473 710
2) GTT to VRAM
==============
Card (ASIC) VRAM Before After
---------------------------------------------
5570 (Redwood) DDR3 1600MHz 3158 3360
6450 (Caicos) DDR5 3200MHz 2995 3393
6570 (Turks) DDR3 1800MHz 3039 3339
5450 (Cedar) DDR3 1600MHz 3246 3404
5450 (Cedar) DDR2 800MHz 2614 3371
E4690 (RV730) DDR3 1400MHz 3084 3426
E6760 (Turks) DDR5 3200MHz 2443 2570
V5700 (RV730) DDR3 ????MHz 3187 3506
2260 (RV620) DDR2 ????MHz 584 3246
6870 (Barts) DDR5 4200MHz 2472 2601
6970 (Cayman) DDR5 4200MHz 2460 2737
More information about the dri-devel
mailing list