<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="text-align:left; direction:ltr;">
<div>On Thu, 2022-03-24 at 15:28 -0400, Alex Deucher wrote:</div>
<blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">
<pre>CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>On Thu, Mar 24, 2022 at 9:43 AM Hoosier, Matt <</pre>
<a href="mailto:Matt.Hoosier@garmin.com">
<pre>Matt.Hoosier@garmin.com</pre>
</a>
<pre>> wrote:</pre>
<blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">
<pre><br></pre>
<pre>On Thu, 2022-03-24 at 11:56 +0200, Pekka Paalanen wrote:</pre>
<pre><br></pre>
<pre>On Wed, 23 Mar 2022 14:19:08 +0000</pre>
<pre><br></pre>
<pre>"Hoosier, Matt" <</pre>
<pre><br></pre>
<a href="mailto:Matt.Hoosier@garmin.com">
<pre>Matt.Hoosier@garmin.com</pre>
</a>
<pre><br></pre>
<pre><br></pre>
<blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">
<pre>wrote:</pre>
</blockquote>
<pre><br></pre>
<pre><br></pre>
<pre>Hi,</pre>
<pre><br></pre>
<pre><br></pre>
<pre>I recently had a reason to wade through Mutter's code to support</pre>
<pre><br></pre>
<pre>systems with more than one GPU. I was a bit surprised to see that</pre>
<pre><br></pre>
<pre>it supports several different strategies for dealing with</pre>
<pre><br></pre>
<pre>scanning out a buffer on a KMS output not associated with the GPU</pre>
<pre><br></pre>
<pre>where the buffer was originally rendered.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>Hi,</pre>
<pre><br></pre>
<pre><br></pre>
<pre>indeed. The reason for multiple paths is that different systems</pre>
<pre><br></pre>
<pre>don't support all ways, or do support some of the ways but the</pre>
<pre><br></pre>
<pre>performance might be abysmal. Knowing which path to take is a</pre>
<pre><br></pre>
<pre>difficult problem, and other than benchmarking (which I didn't</pre>
<pre><br></pre>
<pre>implement in Mutter) you can't really know if what you picked is</pre>
<pre><br></pre>
<pre>going to be fine.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>In particular, the approach of using the secondary GPU's OpenGL</pre>
<pre><br></pre>
<pre>implementation to blit into a dumb buffer was really unexpected.</pre>
<pre><br></pre>
<pre>Typically, dumb buffers get described as a really slow, uncached</pre>
<pre><br></pre>
<pre>mapping of GPU memory into the CPU.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>The support got added here (by Pekka):</pre>
<pre><br></pre>
<pre><br></pre>
<a href="https://urldefense.com/v3/__https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/615__;!!EJc4YC3iFmQ!FFkanSe1AOe2ya_-thuTSNX7kr4pCENbg2UXiNPEhVEP_uxwjWvWCTHD-4dIcSsrfA$">
<pre>https://urldefense.com/v3/__https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/615__;!!EJc4YC3iFmQ!FFkanSe1AOe2ya_-thuTSNX7kr4pCENbg2UXiNPEhVEP_uxwjWvWCTHD-4dIcSsrfA$</pre>
</a>
<pre><br></pre>
<pre><br></pre>
<pre><br></pre>
<pre><br></pre>
<pre>That MR is using the primary GPU to blit, not the secondary GPU.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>If a secondary GPU can have a hardware accelerated OpenGL context,</pre>
<pre><br></pre>
<pre>I don't know why anyone would deliberately use dumb buffers on that</pre>
<pre><br></pre>
<pre>device with OpenGL. GBM offers better ways.</pre>
<pre><br></pre>
<pre><br></pre>
<pre><br></pre>
<pre>This MR cover letter has a better overview of all the methods:</pre>
<pre><br></pre>
<a href="https://urldefense.com/v3/__https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/810__;!!EJc4YC3iFmQ!FFkanSe1AOe2ya_-thuTSNX7kr4pCENbg2UXiNPEhVEP_uxwjWvWCTHD-4d7YGnc1w$">
<pre>https://urldefense.com/v3/__https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/810__;!!EJc4YC3iFmQ!FFkanSe1AOe2ya_-thuTSNX7kr4pCENbg2UXiNPEhVEP_uxwjWvWCTHD-4d7YGnc1w$</pre>
</a>
<pre><br></pre>
<pre><br></pre>
<pre><br></pre>
<pre>Ah, even nicer. Thanks!</pre>
<pre><br></pre>
<pre>In the ranked-order list of strategies there, the zero-copy technique is less preferred than the secondary GPU copy technique. Seems like you'd rarely ever fall through to the zero-copy strategy even if the GPU drivers do both support it. Anything subtle going on there that's good to be aware of? Like maybe a given driver typically supports secondary-GPU-copy XOR zero-copy, so you're fairly likely to reach the second strategy on systems that can handle it.</pre>
<pre><br></pre>
<pre><br></pre>
<pre><br></pre>
<pre>If I follow this right, the blit occurs directly between video</pre>
<pre><br></pre>
<pre>memory owned by the primary GPU into dumb-buffer memory owned by</pre>
<pre><br></pre>
<pre>the secondary GPU, without laboriously using the CPU to do PIO.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>Correct.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>Does this imply that the two GPUs' drivers have to be at least</pre>
<pre><br></pre>
<pre>minimally aware of each other to negotiate some kind of DMA path</pre>
<pre><br></pre>
<pre>directly between the two?</pre>
<pre><br></pre>
<pre><br></pre>
<pre>I don't know the details. It depends on whether you can map</pre>
<pre><br></pre>
<pre>secondary GPU memory to be written by the primary GPU. The specific</pre>
<pre><br></pre>
<pre>use case here is iGPU as primary and virtual as secondary, which</pre>
<pre><br></pre>
<pre>means that video memory for both is more or less "system RAM". No</pre>
<pre><br></pre>
<pre>discrete VRAM involved.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>Oh interesting. I hadn't realized that on the hybrid GPU systems even the dGPU uses system RAM. But on thinking about it, that's probably the only efficient way for the hardware to be designed.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>It is accomplished through the kernel dmabuf framework where</pre>
<pre><br></pre>
<pre>drivers export and import dmabuf.</pre>
<pre><br></pre>
<pre><br></pre>
<pre>Right, makes sense.</pre>
<pre><br></pre>
<pre>So I wonder how I should reason about a system that's configured with 2x of the same discrete graphics card (AMD, if it matters). The compositor would arbitrarly pick whichever of those happened to enumerate first as the primary, and then it's down to the driver details as to which of the four migration paths gets chosen? For the moment, let's assume that none of the stock applications is bothering to use any sort of advanced dmabuf hinting to pick the right GPU node to correspond to the output on which it will eventually display.</pre>
<pre><br></pre>
</blockquote>
<pre><br></pre>
<pre>Since you asked about AMD GPUs, the iGPU render engines can support</pre>
<pre>reading or writing to carve out (system memory carved out for iGPU</pre>
<pre>use), system memory, or remote device memory (e.g., remote device PCI</pre>
<pre>BARs). The iGPU display hardware supports reading from carve out or</pre>
<pre>system memory. For dGPUs, the render engines can support reading and</pre>
<pre>writing to system memory (including iGPU cave out), local VRAM, or</pre>
<pre>remote device memory (e.g., remote device PCI BARs). That said, dGPUs</pre>
<pre>perform best when the buffers are in local VRAM. So ideally if you</pre>
<pre>are using the dGPU for rendering, that would be done in VRAM. Then if</pre>
<pre>the app is being rendered by the dGPU and the app is fullscreen and</pre>
<pre>the display is attached to the iGPU, you'd need to copy that buffer to</pre>
<pre>system memory or carve out so it could be displayed by the iGPU</pre>
<pre>display hardware. If the compositor is compositing the rendering with</pre>
<pre>other buffers on the iGPU, then it may make more sense to keep it in</pre>
<pre>VRAM and just read it from the render engine in the iGPU and write the</pre>
<pre>resulting frame to the display buffer on the iGPU. This also makes</pre>
<pre>sense if you have two dGPUs. Ideally you'd read directly from the</pre>
<pre>remote device memory rather than taking a trip through system ram. As</pre>
<pre>Pekka said, it gets complicated quickly.</pre>
<pre><br></pre>
<pre>What you really want to avoid is reading device memory or carve out</pre>
<pre>with the CPU. Not only does it go over the PCI bus, but MMIO space is</pre>
<pre>usually mapped uncached on the CPU, so you'll be doing uncached reads</pre>
<pre>over a relatively slow bus. If you need to get data out of device</pre>
<pre>memory, it is much better to have a device DMA it to somewhere else.</pre>
<pre>Either the device where the memory is attached (e.g., DMA from local</pre>
<pre>VRAM to system memory or local VRAM to remote VRAM on another device),</pre>
<pre>or the device who wants access to the memory (e.g., DMA from remote</pre>
<pre>VRAM to system memory or remote VRAM to local VRAM on the device).</pre>
<pre>Displaylink devices are a bad example of this. Their display hardware</pre>
<pre>is fed from system memory, so you need to get the data from the render</pre>
<pre>device to system memory. If you try to do the copy with the CPU, the</pre>
<pre>performance will be unusable. This should largely work with dma-buf</pre>
<pre>since the dma-buf will be moved to system memory if the importer</pre>
<pre>doesn't support peer to peer DMA, but in a lot of cases, user mode</pre>
<pre>just mmaps the buffer in VRAM rather than importing it as dma-buf and</pre>
<pre>then copies it using the CPU. That really only works if the source</pre>
<pre>buffer is in cached system memory.</pre>
</blockquote>
<div><br>
</div>
<div>Thanks, Pekka and Alex, for these really interesting descriptions.</div>
<div><br>
</div>
<div>Just out of curiosity, I wonder how something like Windows decides on a general policy for coordinating usage of VRAM from multiple GPUs. I'm not so much interested in doing anything with Windows, but it would be an interesting reference to see how a widely
adopted commercial systems attacks the problem of handling multiple discrete video cards.</div>
<br>
<hr>
<font face="Arial" color="Gray" size="1"><br>
CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender
by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.<br>
</font>
</body>
</html>