[RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

Tue Mar 21 18:55:20 UTC 2023

Am 17.03.23 um 15:45 schrieb Alex Deucher:
> On Thu, Mar 16, 2023 at 7:09 PM Stefano Stabellini
> <sstabellini at kernel.org> wrote:
>> On Thu, 16 Mar 2023, Juergen Gross wrote:
>>> On 16.03.23 14:53, Alex Deucher wrote:
>>>> On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <jgross at suse.com> wrote:
>>>>> On 16.03.23 14:45, Alex Deucher wrote:
>>>>>> On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <jbeulich at suse.com> wrote:
>>>>>>> On 16.03.2023 00:25, Stefano Stabellini wrote:
>>>>>>>> On Wed, 15 Mar 2023, Jan Beulich wrote:
>>>>>>>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>>>>>>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>>>>>>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>>>>>>>>> Xen PVH is the paravirtualized mode and takes advantage of
>>>>>>>>>>>> hardware
>>>>>>>>>>>> virtualization support when possible. It will using the
>>>>>>>>>>>> hardware IOMMU
>>>>>>>>>>>> support instead of xen-swiotlb, so disable swiotlb if
>>>>>>>>>>>> current domain is
>>>>>>>>>>>> Xen PVH.
>>>>>>>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can
>>>>>>>>>>> it get
>>>>>>>>>>> away without resorting to swiotlb in certain cases (like I/O
>>>>>>>>>>> to an
>>>>>>>>>>> address-restricted device)?
>>>>>>>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there
>>>>>>>>>> is no
>>>>>>>>>> need for swiotlb-xen in Dom0. Address translations are done by
>>>>>>>>>> the IOMMU
>>>>>>>>>> so we can use guest physical addresses instead of machine
>>>>>>>>>> addresses for
>>>>>>>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is
>>>>>>>>>> available
>>>>>>>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the
>>>>>>>>>> corresponding
>>>>>>>>>> case is XENFEAT_not_direct_mapped).
>>>>>>>>> But how does Xen using an IOMMU help with, as said,
>>>>>>>>> address-restricted
>>>>>>>>> devices? They may still need e.g. a 32-bit address to be
>>>>>>>>> programmed in,
>>>>>>>>> and if the kernel has memory beyond the 4G boundary not all I/O
>>>>>>>>> buffers
>>>>>>>>> may fulfill this requirement.
>>>>>>>> In short, it is going to work as long as Linux has guest physical
>>>>>>>> addresses (not machine addresses, those could be anything) lower
>>>>>>>> than
>>>>>>>> 4GB.
>>>>>>>>
>>>>>>>> If the address-restricted device does DMA via an IOMMU, then the
>>>>>>>> device
>>>>>>>> gets programmed by Linux using its guest physical addresses (not
>>>>>>>> machine
>>>>>>>> addresses).
>>>>>>>>
>>>>>>>> The 32-bit restriction would be applied by Linux to its choice of
>>>>>>>> guest
>>>>>>>> physical address to use to program the device, the same way it does
>>>>>>>> on
>>>>>>>> native. The device would be fine as it always uses Linux-provided
>>>>>>>> <4GB
>>>>>>>> addresses. After the IOMMU translation (pagetable setup by Xen), we
>>>>>>>> could get any address, including >4GB addresses, and that is
>>>>>>>> expected to
>>>>>>>> work.
>>>>>>> I understand that's the "normal" way of working. But whatever the
>>>>>>> swiotlb
>>>>>>> is used for in baremetal Linux, that would similarly require its use
>>>>>>> in
>>>>>>> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look
>>>>>>> to
>>>>>>> me like an incomplete attempt to disable its use altogether on x86.
>>>>>>> What
>>>>>>> difference of PVH vs baremetal am I missing here?
>>>>>> swiotlb is not usable for GPUs even on bare metal.  They often have
>>>>>> hundreds or megs or even gigs of memory mapped on the device at any
>>>>>> given time.  Also, AMD GPUs support 44-48 bit DMA masks (depending on
>>>>>> the chip family).
>>>>> But the swiotlb isn't per device, but system global.
>>>> Sure, but if the swiotlb is in use, then you can't really use the GPU.
>>>> So you get to pick one.
>>> The swiotlb is used only for buffers which are not within the DMA mask of a
>>> device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
>>> won't use the swiotlb unless you have a buffer above guest physical address of
>>> 16TB (so basically never).
>>>
>>> Disabling swiotlb in such a guest would OTOH mean, that a device with only
>>> 32 bit DMA mask passed through to this guest couldn't work with buffers
>>> above 4GB.
>>>
>>> I don't think this is acceptable.
>>  From the Xen subsystem in Linux point of view, the only thing we need to
>> do is to make sure *not* to enable swiotlb_xen (yes "swiotlb_xen", not
>> the global swiotlb) on PVH because it is not needed anyway.
>>
>> I think we should leave the global "swiotlb" setting alone. The global
>> swiotlb is not relevant to Xen anyway, and surely baremetal Linux has to
>> have a way to deal with swiotlb/GPU incompatibilities.
>>
>> We just have to avoid making things worse on Xen, and for that we just
>> need to avoid unconditionally enabling swiotlb-xen. If the Xen subsystem
>> doesn't enable swiotlb_xen/swiotlb, and no other subsystem enables
>> swiotlb, then we have a good Linux configuration capable of handling the
>> GPU properly.
>>
>> Alex, please correct me if I am wrong. How is x86_swiotlb_enable set to
>> false on native (non-Xen) x86?
> In most cases we have an IOMMU enabled and IIRC, TTM has slightly
> different behavior for memory allocation depending on whether swiotlb
> would be needed or not.

Well "slightly different" is an understatement. We need to disable quite 
a bunch of features to make swiotlb work with GPUs.

Especially userptr and inter device sharing won't work any more.

Regards,
Christian.

>
> Alex