drm/radeon/kms: improve performance of blit-copy

Thu Oct 13 12:00:02 PDT 2011

Am 13.10.2011 05:29, schrieb Ilija Hadzic:
> 
> The following set of patches will improve the performance of
> blit-copy functions for Radeon GPUs based on R600, R700, Evergreen
> and NI ASICs.
> 
> The foundation for improvement is the use of tiled mode access (which
> for copying bo's can be used regardless of whether the content is
> tiled or not), and segmenting the memory block being copied into
> rectangles whose edge ratio is between 1:1 and 1:2. This maximizes
> the number of PCIe transactions that use maximum payload size
> (typically 128 bytes) and also creates a memory access pattern that
> is more favorable for both VRAM and host DRAM than what's currently
> in the kernel.
> 
> To come up with the new blit-copy code, I did a lot of PCIe traffic
> analysis with the bus analyzer and also had many discussions with
> Alex, trying to explain what's going on (thanks to Alex for his
> time).
> 
> Below (at the end of this note) are the results of some benchmarks 
> that I did with various GPUs (all in the same host: Intel i7 CPU, X58
> chipset, three DRAM channels). To run the tests on your machine load
> the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most
> significant improvement is in the upstream (VRAM to GART) direction
> because that's where the PCIe transactions were fragmented and also
> where memory access pattern was such that it created a lot of 
> backpressure from the host.
> 
> It is also interesting that high-end devices (e.g. Cayman) exhibit 
> the least improvement and were the worst to begin with. This is 
> because high-end devices copy more tiles in parallel which in turn
> can create bank conflicts on host memory and cause the host to do
> lots of bank-close/precharge/bank-open cycles.
Interesting stuff! Nice results showing the low-end devices completely
blowing away the high-end ones for VRAM->GTT blits :-).
I guess it isn't possible to temporarily disable some RBEs or otherwise
reconfigure the chip that you could get the same performance for the
high-end chips? Granted the high-end chips are only much slower for
VRAM->GTT according to these results but even the other way it's still
~20% or so.
Anyway, can't comment much on the patches, though the idea certainly
seems to make sense.

Roland


> As an added "bonus", I also did some code cleanup and consolidated 
> the repeated code into common function, so r600 and evergreen/NI 
> parts now share the blit-copy code. I also expanded on the benchmark
> coverage, so the module now takes benckmark parameter value between 1
> and 8 and each results in running a different benchmark.
> 
> For details, see the commit log messages and the code. I have been
> running with these patches for a few months (and I kept rebasing them
> to drm-core-next as the public git progressed) and I used them in a
> system setup that does *many* copying of this kind (and does them
> frequently); I have not seen instabilities introduced by these
> patches. I also verified the correctness of the copy using test=1
> parameter for each GPU that I had and the test passed.
> 
> I would welcome some feedback and if you run the benchmarks with the
> new blit code, I would very much like to hear what kind of
> improvement you are seeing.
>