Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?

Tue Jul 23 08:11:33 UTC 2019

On 2019-07-22 11:39 a.m., Timur Kristóf wrote:
>>>
>>> 1. Why is the GTT->VRAM copy so much slower than the VRAM->GTT
>>> copy?
>>>
>>> 2. Why is the bus limited to 24 Gbit/sec? I would expect the
>>> Thunderbolt port to give me at least 32 Gbit/sec for PCIe traffic.
>>
>> That's unrealistic I'm afraid. As I said on IRC, from the GPU POV
>> there's an 8 GT/s x4 PCIe link, so ~29.8 Gbit/s (= 32 billion bit/s;
>> I
>> missed this nuance on IRC) is the theoretical raw bandwidth. However,
>> in
>> practice that's not achievable due to various overhead[0], and I'm
>> only
>> seeing up to ~90% utilization of the theoretical bandwidth with a
>> "normal" x16 link as well. I wouldn't expect higher utilization
>> without
>> seeing some evidence to suggest it's possible.
>>
>>
>> [0] According to
>> https://www.tested.com/tech/457440-theoretical-vs-actual-bandwidth-pci-express-and-thunderbolt/
>> , PCIe 3.0 uses 1.54% of the raw bandwidth for its internal encoding.
>> Also keep in mind all CPU<->GPU communication has to go through the
>> PCIe
>> link, e.g. for programming the transfers, in-band signalling from the
>> GPU to the PCIe port where the data is being transferred to/from, ...
> 
> Good point, I used 1024 and not 1000. My mistake.
> 
> There is something else:
> In the same benchmark there is a "fill->GTT  ,SDMA" row which has a
> 4035 MB/s number. If that traffic goes through the TB3 interface then
> we just found our 32 Gbit/sec.

The GPU is only connected to the host via PCIe, there's nowhere else it
could go through.

> Now the question is, if I understand this correctly and the SDMA can
> indeed do 32 Gbit/sec for "fill->GTT", then why can't it do the same
> with other kinds of transfers? Not sure if there is a good answer to
> that question though.
> 
> Also I still don't fully understand why GTT->VRAM is slower than VRAM-
>> GTT, when the bandwidth is clearly available.

While those are interesting questions at some level, I don't think they
will get us closer to solving your problem. It comes down to identifying
inefficient transfers across PCIe and optimizing them.

> Side note: with regards to that 1.5% figure, the TB3 tech brief[0]
> explicitly mentions this and says that it isn't carried over: "the
> underlying protocol uses some data to provide encoding overhead which
> is not carried over the Thunderbolt 3 link reducing the consumed
> bandwidth by roughly 20 percent (DisplayPort) or 1.5 percent (PCI
> Express Gen 3)"

That just means the internal TB3 link only carries the payload data from
the PCIe link, not the 1.5% of bits used for the PCIe encoding. TB3
cannot magically make the PCIe link itself work without the encoding.

-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer