[RFC PATCH 00/12] Private MMIO support for private assigned dev
Alexey Kardashevskiy
aik at amd.com
Wed May 21 10:41:03 UTC 2025
On 16/5/25 02:53, Zhi Wang wrote:
> On Thu, 15 May 2025 16:44:47 +0000
> Zhi Wang <zhiw at nvidia.com> wrote:
>
>> On 15.5.2025 13.29, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 13/5/25 20:03, Zhi Wang wrote:
>>>> On Mon, 12 May 2025 11:06:17 -0300
>>>> Jason Gunthorpe <jgg at nvidia.com> wrote:
>>>>
>>>>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
>>>>> wrote:
>>>>>
>>>>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
>>>>>>>> it is just about managing the translation control of the
>>>>>>>> device.
>>>>>>>
>>>>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
>>>>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put
>>>>>>> the device in TDISP LOCKED state, so that device behaves
>>>>>>> differently from before. Then why put it in IOMMUFD?
>>>>>>
>>>>>>
>>>>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece
>>>>>> of IOMMU on the host CPU. The device does not know about the VM,
>>>>>> it just enables/disables encryption by a request from the CPU
>>>>>> (those start/stop interface commands). And IOMMUFD won't be
>>>>>> doing DOE, the platform driver (such as AMD CCP) will. Nothing
>>>>>> to do for VFIO here.
>>>>>>
>>>>>> We probably should notify VFIO about the state transition but I
>>>>>> do not know VFIO would want to do in response.
>>>>>
>>>>> We have an awkward fit for what CCA people are doing to the
>>>>> various Linux APIs. Looking somewhat maximally across all the
>>>>> arches a "bind" for a CC vPCI device creation operation does:
>>>>>
>>>>> - Setup the CPU page tables for the VM to have access to the
>>>>> MMIO
>>>>> - Revoke hypervisor access to the MMIO
>>>>> - Setup the vIOMMU to understand the vPCI device
>>>>> - Take over control of some of the IOVA translation, at least
>>>>> for T=1, and route to the the vIOMMU
>>>>> - Register the vPCI with any attestation functions the VM might
>>>>> use
>>>>> - Do some DOE stuff to manage/validate TDSIP/etc
>>>>>
>>>>> So we have interactions of things controlled by PCI, KVM, VFIO,
>>>>> and iommufd all mushed together.
>>>>>
>>>>> iommufd is the only area that already has a handle to all the
>>>>> required objects:
>>>>> - The physical PCI function
>>>>> - The CC vIOMMU object
>>>>> - The KVM FD
>>>>> - The CC vPCI object
>>>>>
>>>>> Which is why I have been thinking it is the right place to manage
>>>>> this.
>>>>>
>>>>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>>>>> stays in VFIO.
>>>>>
>>>>>>>> So your issue is you need to shoot down the dmabuf during vPCI
>>>>>>>> device destruction?
>>>>>>>
>>>>>>> I assume "vPCI device" refers to assigned device in both shared
>>>>>>> mode & prvate mode. So no, I need to shoot down the dmabuf
>>>>>>> during TSM unbind, a.k.a. when assigned device is converting
>>>>>>> from private to shared. Then recover the dmabuf after TSM
>>>>>>> unbind. The device could still work in VM in shared mode.
>>>>>
>>>>> What are you trying to protect with this? Is there some intelism
>>>>> where you can't have references to encrypted MMIO pages?
>>>>>
>>>>
>>>> I think it is a matter of design choice. The encrypted MMIO page is
>>>> related to the TDI context and secure second level translation
>>>> table (S-EPT). and S-EPT is related to the confidential VM's
>>>> context.
>>>>
>>>> AMD and ARM have another level of HW control, together
>>>> with a TSM-owned meta table, can simply mask out the access to
>>>> those encrypted MMIO pages. Thus, the life cycle of the encrypted
>>>> mappings in the second level translation table can be de-coupled
>>>> from the TDI unbound. They can be reaped un-harmfully later by
>>>> hypervisor in another path.
>>>>
>>>> While on Intel platform, it doesn't have that additional level of
>>>> HW control by design. Thus, the cleanup of encrypted MMIO page
>>>> mapping in the S-EPT has to be coupled tightly with TDI context
>>>> destruction in the TDI unbind process.
>>>>
>>>> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
>>>> cross-module notification to KVM to do cleanup in the S-EPT.
>>>
>>> QEMU should know about this unbind and can tell KVM about it too.
>>> No cross module notification needed, it is not a hot path.
>>>
>>
>> Yes. QEMU knows almost everything important, it can do the required
>> flow and kernel can enforce the requirements. There shouldn't be
>> problem at runtime.
>>
>> But if QEMU crashes, what are left there are only fd closing paths
>> and objects that fds represent in the kernel. The modules those fds
>> belongs need to solve the dependencies of tearing down objects
>> without the help of QEMU.
>>
>> There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM
>> fds at that time. Who should trigger the TDI unbind at this time?
>>
>> I think it should be triggered in the vdevice teardown path in IOMMUfd
>> fd closing path, as it is where the bind is initiated.
This is how I do it now, yes.
>>
>> iommufd vdevice tear down (iommu fd closing path)
>> ----> tsm_tdi_unbind
>> ----> intel_tsm_tdi_unbind
>> ...
>> ----> private MMIO un-maping in KVM
>> ----> cleanup private MMIO mapping in S-EPT and
>> others ----> signal MMIO dmabuf can be safely removed.
>> ^TVM teardown path (dmabuf uninstall path)
>> checks this state and wait before it can decrease the
>> dmabuf fd refcount
This extra signaling is not needed on AMD SEV though - 1) VFIO will destroy this dmabuf on teardown (and it won't care about its RMP state) and 2) the CCP driver will clear RMPs for the device's resources. KVM mapping will die naturally when KVM fd is closed.
>> ...
>> ----> KVM TVM fd put
>> ----> continue iommufd vdevice teardown.
>>
>> Also, I think we need:
>>
>> iommufd vdevice TSM bind
>> ---> tsm_tdi_bind
>> ----> intel_tsm_tdi_bind
>> ...
>> ----> KVM TVM fd get
>
> ident problem, I mean KVM TVM fd is in tsm_tdi_bind(). I saw your code
> has already had it there.
Yup, that's right.
>
>> ...
>>
>> Z.
>>
>>>
>>>> So shooting down the DMABUF object (encrypted MMIO page) means
>>>> shooting down the S-EPT mapping and recovering the DMABUF object
>>>> means re-construct the non-encrypted MMIO mapping in the EPT after
>>>> the TDI is unbound.
>>>
>>> This is definitely QEMU's job to re-mmap MMIO to the userspace (as
>>> it does for non-trusted devices today) so later on nested page
>>> fault could fill the nested PTE. Thanks,
>>>
>>>
>>>>
>>>> Z.
>>>>
>>>>>>> What I really want is, one SW component to manage MMIO dmabuf,
>>>>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
>>>>>>> operations cause these ops are interconnected according to
>>>>>>> secure firmware's requirement.
>>>>>>
>>>>>> This SW component is QEMU. It knows about FLRs and other config
>>>>>> space things, it can destroy all these IOMMUFD objects and talk
>>>>>> to VFIO too, I've tried, so far it is looking easier to manage.
>>>>>> Thanks,
>>>>>
>>>>> Yes, qemu should be sequencing this. The kernel only needs to
>>>>> enforce any rules required to keep the system from crashing.
>>>>>
>>>>> Jason
>>>>>
>>>>
>>>
>>
>
--
Alexey
More information about the dri-devel
mailing list