[Spice-devel] RFC: Virtual Media Controller (VMC) concept: enhancing IP telephony systems in Spice VDI

Tue Oct 15 12:09:37 CEST 2013

----- Original Message -----
> Hello Spice developers,
> 
> I want to introduce my idea of Virtual Media Controller (VMC),
> enhancing support for IP telephony in Spice-based VDI. Hope for your
> feedback!
> This is a concept only, very high level and without any Proof of
> Concept implemented yet. The concept is divided into 3 levels: VMC
> API, VMC Advanced and VMC Ultimate. The level of unknown increases
> with them. VMC API seems to be fairly straightforward and doable,
> others are more risky and already have known open issues questioning a
> possibility to implement them (and probably much more I'm not aware of
> yet).
> 
> *Why VMC is needed?*
> Main problem of IP telephony software running in VDI is media stream
> hairpinning at VDI server. Let we have Alice and Bob working in office
> O2, they're connecting to their virtual desktops at VDI server in a
> main office O1. They're using softphones running at the VMs to make a
> p2p audio call (video part behaves similarly, just adds unnecessary
> complication to this example). Let's look at the route of Alice's
> outgoing audio stream. Audio is captured from microphone at her
> PC/thin client, then encoded by Spice client and sent over Spice
> channel to VM, where it is decoded to PCM and presented as a source
> for virtual microphone. Softphone then encodes it once again and sends
> it to a peer, softphone of Bob running at another VM of the same VDI
> server. The stream is decoded once again, played into virtual speaker,
> which in turn encodes it and sends over Spice channel to Spice client
> at Bob's PC, where it is decoded and finally played out in a real
> headset.
> 
> We can see 2 major issues with this scheme:
> 1. Media stream is traveling via VDI server, not p2p. So even if 2
> people in office O2 are making a call, the traffic goes through VDI
> server at office O1. This introduces extra delay into the
> conversation, potentially increases jitter and packet loss (depends on
> network), and this results in extra network load.
> 2. Media stream is transcoded (decoded and then encoded) at VDI server
> twice (if count Bob's stream, 4 times!). This means extra CPU usage of
> VDI server, effectively reducing VM density. This also means
> degradation of quality if lossy codecs are used.
> 
> *VMC Solution*
> The most adequate solution to both issues is to make the conversation
> p2p, remove VDI server from the route entirely. So the question is
> only how to actually make it.
> First part of the VMC idea is to introduce a media engine at client
> side, and API for softphone developers to manipulate this engine. We
> may think about following components:
> 1. VMC Agent - a component providing media-handling API for
> applications running at this virtual machine. [Probably would need to
> work through Spice agent, or via similar means - adding new virtual
> device to qemu]
> 2. VMC Engine - media engine running at user's client/PC. It provides
> actual media handling and is controlled with commands from VMC Agent.
> 3. VMC Transport - a "component" implementing connection between the
> agent and the engine. Actual design is TBD. This is some sort of RPC
> over Spice connection.
> 4. VMC OverlayRenderer - this advanced component is needed for video
> support only. It integrates local video rendering inside virtual
> session window.
>
> Softphone developers would need to use VMC Agent API as a media engine
> for their application - so changes in the softphone are required.

Although I don't know in details Telepathy, it looks like what you describe. Except that audio/video stream is proxyed, and decoded directly in client. Is that correct?

1. agent: telepathy session & API
2. engine: gstreamer
3. tbd (rtp?), dedicated spice channel
4. internal of spice client

> *VMC Advanced*
> More general problem can be set: make arbitrary softphones running at
> VMs work without VDI hairpinning. Arbitrary means without code changes
> in these softphones. Solution of this problem adds much more value to
> Spice VDI, as any third-party applications, including commercial ones
> like Skype, would be supported. Skype may be bad example... But modern
> enterprise SIP or H.323-based softphones may be a good one (MS Lync to
> name one).
> 
> But first of all, let's look into an interesting yet mostly
> non-commercial case - Linux VM. For this case there is a chance of
> implementing Agent API to follow APIs of widespread media engines -
> GStreamer, VLC (what else?). This way we'd be able to support
> arbitrary media apps based on these engines.

Yes, we discussed about this for video-passthrough. Having a GStreamer passthrough would be quite awesome, although limited to very few use cases unfortunately, since most of the time the decoded video is post-processed, and there are relatively few GStreamer apps among all the media apps. Also, you have issues on client side, like codec support (which can be discarded by saying that spice doesn't ship the problematic codecs itself, but then the story is not fun for windows and mac users).

> [Notes:
> 1. If we add Google WebRTC media engine bindings, softphone developers
> who use this API should be able to add support for our system fairly
> straightforwardly.
> 2. GStreamer, VLC, WebRTC are cross-platform, so implementing their
> API may help with enabling support of some softphones at Windows VMs
> as well]
> 
> One major part that needs additional work is signaling. The issue is
> following: when a communication channel is established between 2
> parties, they exchange their IP addresses in the signaling messages.
> Softphone at VM will advertise its virtual IP address in such
> situation - but we need to make the client to be the receiving end, so
> we need client's IP address to appear in softphone's message. And we
> want our solution to be as signaling protocol agnostic as possible,
> i.e. parsing and changing IP address in signaling messages isn't an
> option (and signaling traffic is usually guarded by TLS connections
> anyway). Dealing with this is big open question (up for a networking
> guru!) . I'd love any comments / possible solutions for this!
> 
> How I see this problem:
> 1. In VDI server with real address IP1 there is a VM with some address
> IP2 (NAT or not - not specified)
> 2. At the VM, an application is running (softphone)
> 3. User connects via Spice client, from a client/PC with address IP3
> 4. Need to trick the softphone into thinking it is running at the
> machine with IP3
> 5. The softphone signaling should continue to work normally otherwise
> 6. All other applications at the VM should continue to work normally
> using address IP2
> Variables to play with: NAT, virtual network driver, configuration of
> softphone (we can expect it uses particular ports for media)
> 
> The only seemingly implementable idea of mine works only for softphone
> which supports ICE (NAT traversal), and only for the case when there
> is STUN-traversable NAT (i.e. not symmetric one). The workflow:
> 1. Once Spice client connects, it also establishes connection with
> fake STUN server and instructs it about translation IP2->IP3.
> 2. Once softphone attempts to make or receive a call, it asks STUN
> server for a candidate IP address - and receives IP3.
> Fake STUN server implementation TBD.
> 
> Weak spots:
> a) what about other applications relying on this 'STUN' server? They
> won't work probably.
> b) What if there is no NAT at all? Softphone will detect P2P
> connectivity and won't use ICE probably...
> c) What about softphones which do not support ICE?
> I also thought about solution involving custom virtual network
> drivers, but it seems to be impossible to split behavior for softphone
> and for the rest of the system at this low level...

It looks like you have thought about this a lot. You should start documenting this on a spice-space wiki http://www.spice-space.org/page/PlannedFeatures.

Also, since this feature is quite specific to voip-domain, I think it is best to ask voip people about the tricks you can do, on Telepathy or Ekiga mailing list for example.

> *VMC Ultimate*
> This is the last step forward, to cover softphone which isn't based on
> common media engine (probably things like Skype can be covered).
> The main idea is to detect what media action a softphone is doing -
> and actually do it at remote VMC engine. The approach assumes
> signaling issue is solved somehow. Amended virtual audio, video and
> network devices/drivers are needed. For example, let's look at
> incoming audio call scenario, user picks up the call. The VMC Engine
> at client detects presence of incoming audio stream, parses out the
> codec (pcap?) and makes virtual network send sample stream to the
> softphone. Once softphone starts playing out, virtual audio device
> detects the decoded sample in the output. This is a sign for VMC
> Engine to start playing out the incoming audio stream. And once
> softphone starts reading a sample from virtual microphone and sending
> encoded audio stream in the output, it can be detected at the virtual
> network level and VMC Engine would start the actual outgoing audio
> stream.
> 
> Same can be done for video - need to detect presence of decoded video
> stream sample, and notify VMC Engine to start rendering - and the
> stream should be rendered over the appropriate window using VMC
> OverlayRenderer. Samples need to be really simple (e.g. black picture
> with a label as a video stream) - so they should be easily generated
> (or even pre-loaded) and detected, and processing of them should take
> as less server CPU as possible (true not for all codecs - only for
> those for which complexity of encoding/decoding depends on incoming
> data parameters; for adaptive codecs like H.264 SVC, smaller supported
> samples can be used than actual video stream decoded in VMC engine...)
> 
> Does this make any sense? I was inspired by current Spice detection of
> video stream... Looks too complicated and risky probably.
> 
> Obvious hard case for all these VMC schemes - encrypted media streams
> (usually SRTP).

Indeed, that's what I was going to ask ;)