[Intel-gfx] Deal with stolen memory in GVT-d (passthrough)

Thu May 25 13:19:33 UTC 2017

-----Original Message-----
From: Alex Williamson [mailto:alex.williamson at redhat.com] 
Sent: Wednesday, May 24, 2017 11:57 PM
To: Wang, Zhi A <zhi.a.wang at intel.com>
Cc: Dong, Chuanxiao <chuanxiao.dong at intel.com>; Daniel Vetter <daniel.vetter at ffwll.ch>; Zhang, Xiong Y <xiong.y.zhang at intel.com>; Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Chris Wilson <chris at chris-wilson.co.uk>; Lv, Zhiyuan <zhiyuan.lv at intel.com>; Zhenyu Wang <zhenyuw at linux.intel.com>; Tian, Kevin <kevin.tian at intel.com>
Subject: Re: Deal with stolen memory in GVT-d (passthrough)

On Wed, 24 May 2017 13:24:55 +0800
Zhi Wang <zhi.a.wang at intel.com> wrote:

> On 05/24/17 10:33, Alex Williamson wrote:
> > On Wed, 24 May 2017 09:10:23 +0800
> > Zhi Wang <zhi.a.wang at intel.com> wrote:
> >  
> >> On 05/24/17 01:01, Alex Williamson wrote:  
> >>> On Tue, 23 May 2017 17:14:53 +0800 Zhi Wang <zhi.a.wang at intel.com> 
> >>> wrote:
> >>>     
> >>>> Hi All:
> >>>>        We did an investigation for the further directions. First, 
> >>>> Alex, do you wish us to support exposing stolen memory through 
> >>>> RMRR in QEMU IOMMU emulation? Suppose this is a nested 
> >>>> virtualization case: QEMU IOMMU emulation expose the stolen memory region through RMRR to L2 guest?
> >>> Yes, if the guest has a vIOMMU then the stolen memory mapping to 
> >>> the device needs to be protected, via an RMRR if the vIOMMU is 
> >>> VT-d.  I don't see how nesting adds any additional complication 
> >>> beyond regular IOMMU support in the guest.
> >>>     
> >>>> For exposing stolen memory in L1 guest, Xiong and I found several opens:
> >>>>
> >>>> As we are going to implement RMRR as a special VFIO region, 
> >>>> suppose QEMU would obtain the information of stolen memory region via VFIO ioctls.
> >>>> One problem is currently the memory layout would be initialized 
> >>>> earlier than vfio device realization. After the memory layout is 
> >>>> fixed in machine initialization, if later the host stolen memory 
> >>>> base (H-GSM
> >>>> BASE) is falling in the guest RAM region, we got a overlap problem.
> >>>>
> >>>>    From our point of view, what we can do are:
> >>>>
> >>>> choice a) Adjust the memory layout in vfio_realize(). But it 
> >>>> would be complicated and buggy as no one has this requirement before.
> >>> I don't think QEMU would be in favor of devices manipulating the 
> >>> machine layout.
> >>>     
> >>>> choice b) Query the vfio device stolen memory region in
> >>>> vfio_instance_post_init() and reserve the GSM BASE at this time.
> >>>> choice c) Add a new command line option to "vfio-pci" device, 
> >>>> user can specify the GSM BASE (from kernel VFIO driver) and QEMU 
> >>>> reserves it in vfio_instance_post_init().
> >>> c) seems like a constant source of user confusion.  Where is the 
> >>> user going to learn about the stolen memory base address in order 
> >>> to properly configure their VM?
> >>>
> >>> b) also doesn't seem particularly viable since we only understand 
> >>> the device we're working with after opening the device, which we 
> >>> cannot do without a lot of setup, which is not done by this point.
> >> For b) If we are going to walk this way, I suppose
> >>                * we will open the vfio device in 
> >> instance_post_init() then close it
> >>                * or we move some code from vfio_realize() into 
> >> instance_post_init().
> > I don't think this is going to work, it's abusing the entire QEMU 
> > device infrastructure to meddle with the machine layout for a device 
> > with ill conceived requirements.
> That's also my concern. :( I just put this option on the table. :P
> >>> Choice c) seems like the least bad option (this is why not being 
> >>> "just a PCI device" is so hard to deal with), but this should 
> >>> really be discussed on qemu-devel, maybe there are better ideas 
> >>> there.  Thanks,
> >> For c) my idea is vfio can expose some region info in sysfs. so 
> >> user could know how to fill the information of stolen memory, and 
> >> we can check that in the vfio_realize().
> > vfio_realize()?  If we could wait until then we wouldn't need sysfs.
> > vfio devices have no representation in sysfs nor am I particularly 
> > fond of creating one.  Thanks,
> Sorry for the confusing. I mean user can pass the GSM base and size 
> through qemu-command line, but we still need to check if the 
> configuration from user is the same as the configuration from VFIO 
> stolen memory region in vfio_realize().
> 
> For how user could get the GSM base and size, it's just an rough idea, 
> there should be many possible ways, like user could get or read it 
> from pci device node in sysfs via a script. Let's see if we can have 
> some graceful way to do that.
> 
> Uh. Literally, qemu just need to check if the user passes the right 
> GSM base and size, right?  No matter how he get it. :)

[Alex]That's true, we can validate it against the device once we get to that point, though it becomes difficult to justify to users why they need to go figure it out themselves given that incongruity.  Are there any constraints on the host system for where stolen memory is placed?  IIRC it's a 1MB aligned address, is it a 32bit mapping or can it be 64bit?

[Zhi] According to this pdf:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/desktop-6th-gen-core-family-datasheet-vol-2.pdf

	4.24 Base Data of Stolen Memory (BDSM)-Offset 5Ch

	This register contains the base address of graphics data stolen DRAM memory. BIOS
	determines the base of graphics data stolen memory by subtracting the graphics data
	stolen memory size (PCI Device 0 offset 52 bits 7:4) from TOLUD (PCI Device 0 offset
	BC bits 31:20).

The stolen memory stays only below TOLUD (Top of low usable memory)[1], it must be a 32bit base PA.

[Alex]If there are any constraints that would make it relatively compatible with a QEMU VM already, we might have the option of simply marking the range reserved and ignoring that we're wasting that VM memory.Wasteful, but perhaps easier than changing the VM memory map. 

[Zhi] Yes, that's a great option. :) I put some detailed information in approach B below.

[Alex]We also have the option of using the IOMMU to map VM memory to the stolen memory IOVA such that the VM has their own stolen memory space.

[Zhi] That's a good point. But some function blocks in GEN will not care IOMMU, like GuC. So we can do that, but there might be a naughty HW breaking our magic.

Yeah we can leave the host stolen memory for those naughty function blocks in GEN and use IOMMU to map VM memory to the IOVA = HGSM BASE. That looks better?

If function blocks honor IOMMU, they can still use VM-dedicated stolen memory as the IOVA = GSM BASE has been mapped into VM stolen memory by IOMMU. Then we don't need to care about changing VM memory layout. For function blocks don't honor IOMMU, they are just directly access host stolen memory. :(

Guest still have a change to sniff the information in the host stolen memory, only if it knows how to manipulate the HW function blocks which don't honor IOMMU.

Looks the isolation is still not perfect.

[Alex] Again, this is perhaps viewed as wasteful, but I can only presume that stolen memory is not cleared on IGD FLR, so there might be a security advantage to avoid granting the user access to the host stolen memory. Otherwise vfio would likely need to explicitly memset stolen memory when opening and releasing the device.

[Zhi] I totally agree stolen memory should be memset to zero. :P. e.g. If one guest uses a part of the stolen memory as framebuffer, and suddenly shutdown. After that, another guest boots up, sniffs and saves the framebuffer into a picture file, then it could know the screen content from previous guest. One more concern is I remember some SW/HW would rely on the data (from BIOS) in the stolen memory to do the configuration, If this is the case, then we can clear the stolen memory selectively. Would that be an option? :)

[Alex] Also be aware that reading more than the first 64bytes of config space on a device requires privileges, so if QEMU starts poking around in sysfs to learn the stolen memory size and location, that means that libvirt would need to grant QEMU sufficient privileges to do that. Thanks,

[Zhi] Yes. I thought about that before, if an ordinary VFIO pass-through doesn't need higher privileges, but IGD pass-through needs that, it's a deployment burden.

Looks our steps are:

1) Allocate VM memory from GPA = HGSM BASE for guest stolen memory. Mostly this is for guest which is able to populate its GPU page table based on the same IOVA = GPA mapping as we have to let host GSM BASE = guest GSM BASE. (But I think we can grant a smallest amount of stolen memory :P)

In hw/vfio/pci-quirks.c:

Approach a: 

Suppose guest wouldn't directly read/write a lot of data from stolen memory as it has been marked as E820_RESERVED, maybe we can allocate a new MemoryRegion then add a memory_region_init_io() + memory_region_add_subregion_overlap() in IGD quirk

Approach b:

If we are going to avoid the trap above:
	- Case A: [HGSM BASE, HGSM BASE + HGSM SIZE) fully falls into the guest ram.
		We reserve that portion in E820
	- Case B: [HGSM BASE, HGSM BASE + HGSM SIZE) partially falls into the guest ram.
		We allocate the missing amount of guest ram and link it after the end of ram below 4G.
	- Case C: [HGSM BASE, HGSM BASE + HGSM SIZE) falls into a non-guest ram range.
		We allocate a new portion of guest ram.

Or no matter what, we just allocate a new ram/new MemoryRegion and add it into system memory space then bump up its priority higher than system ram.
Then use e820_add_entry() to reserve that guest ram as stolen memory.

2) Map IOVA = HGSM BASE identifiably in IOMMU (using the VM dedicated stolen memory above) for those functions honor IOMMU. Might copy some configuration from host stolen memory if necessary.

3) For those HW functions which don't honor IOMMU, we check if there is any security vulnerability.

4) Memset host stolen memory when open/release VFIO device (HW functions which don't honor IOMMU might leaks some information here. The smallest amount of stolen memory costs lesser time here)

Feel free to let me know your ideas and concern.

Thanks,
Zhi.

[1] Refer to Section 3.37 for introduction to TOLUD.