[Spice-devel] RFC: Virtual Media Controller (VMC) concept: enhancing IP telephony systems in Spice VDI

Mon Oct 14 23:18:33 CEST 2013

Hello Spice developers,

I want to introduce my idea of Virtual Media Controller (VMC),
enhancing support for IP telephony in Spice-based VDI. Hope for your
feedback!
This is a concept only, very high level and without any Proof of
Concept implemented yet. The concept is divided into 3 levels: VMC
API, VMC Advanced and VMC Ultimate. The level of unknown increases
with them. VMC API seems to be fairly straightforward and doable,
others are more risky and already have known open issues questioning a
possibility to implement them (and probably much more I'm not aware of
yet).

*Why VMC is needed?*
Main problem of IP telephony software running in VDI is media stream
hairpinning at VDI server. Let we have Alice and Bob working in office
O2, they're connecting to their virtual desktops at VDI server in a
main office O1. They're using softphones running at the VMs to make a
p2p audio call (video part behaves similarly, just adds unnecessary
complication to this example). Let's look at the route of Alice's
outgoing audio stream. Audio is captured from microphone at her
PC/thin client, then encoded by Spice client and sent over Spice
channel to VM, where it is decoded to PCM and presented as a source
for virtual microphone. Softphone then encodes it once again and sends
it to a peer, softphone of Bob running at another VM of the same VDI
server. The stream is decoded once again, played into virtual speaker,
which in turn encodes it and sends over Spice channel to Spice client
at Bob's PC, where it is decoded and finally played out in a real
headset.

We can see 2 major issues with this scheme:
1. Media stream is traveling via VDI server, not p2p. So even if 2
people in office O2 are making a call, the traffic goes through VDI
server at office O1. This introduces extra delay into the
conversation, potentially increases jitter and packet loss (depends on
network), and this results in extra network load.
2. Media stream is transcoded (decoded and then encoded) at VDI server
twice (if count Bob's stream, 4 times!). This means extra CPU usage of
VDI server, effectively reducing VM density. This also means
degradation of quality if lossy codecs are used.

*VMC Solution*
The most adequate solution to both issues is to make the conversation
p2p, remove VDI server from the route entirely. So the question is
only how to actually make it.
First part of the VMC idea is to introduce a media engine at client
side, and API for softphone developers to manipulate this engine. We
may think about following components:
1. VMC Agent - a component providing media-handling API for
applications running at this virtual machine. [Probably would need to
work through Spice agent, or via similar means - adding new virtual
device to qemu]
2. VMC Engine - media engine running at user's client/PC. It provides
actual media handling and is controlled with commands from VMC Agent.
3. VMC Transport - a "component" implementing connection between the
agent and the engine. Actual design is TBD. This is some sort of RPC
over Spice connection.
4. VMC OverlayRenderer - this advanced component is needed for video
support only. It integrates local video rendering inside virtual
session window.

Softphone developers would need to use VMC Agent API as a media engine
for their application - so changes in the softphone are required.

*VMC Advanced*
More general problem can be set: make arbitrary softphones running at
VMs work without VDI hairpinning. Arbitrary means without code changes
in these softphones. Solution of this problem adds much more value to
Spice VDI, as any third-party applications, including commercial ones
like Skype, would be supported. Skype may be bad example... But modern
enterprise SIP or H.323-based softphones may be a good one (MS Lync to
name one).

But first of all, let's look into an interesting yet mostly
non-commercial case - Linux VM. For this case there is a chance of
implementing Agent API to follow APIs of widespread media engines -
GStreamer, VLC (what else?). This way we'd be able to support
arbitrary media apps based on these engines.
[Notes:
1. If we add Google WebRTC media engine bindings, softphone developers
who use this API should be able to add support for our system fairly
straightforwardly.
2. GStreamer, VLC, WebRTC are cross-platform, so implementing their
API may help with enabling support of some softphones at Windows VMs
as well]

One major part that needs additional work is signaling. The issue is
following: when a communication channel is established between 2
parties, they exchange their IP addresses in the signaling messages.
Softphone at VM will advertise its virtual IP address in such
situation - but we need to make the client to be the receiving end, so
we need client's IP address to appear in softphone's message. And we
want our solution to be as signaling protocol agnostic as possible,
i.e. parsing and changing IP address in signaling messages isn't an
option (and signaling traffic is usually guarded by TLS connections
anyway). Dealing with this is big open question (up for a networking
guru!) . I'd love any comments / possible solutions for this!

How I see this problem:
1. In VDI server with real address IP1 there is a VM with some address
IP2 (NAT or not - not specified)
2. At the VM, an application is running (softphone)
3. User connects via Spice client, from a client/PC with address IP3
4. Need to trick the softphone into thinking it is running at the
machine with IP3
5. The softphone signaling should continue to work normally otherwise
6. All other applications at the VM should continue to work normally
using address IP2
Variables to play with: NAT, virtual network driver, configuration of
softphone (we can expect it uses particular ports for media)

The only seemingly implementable idea of mine works only for softphone
which supports ICE (NAT traversal), and only for the case when there
is STUN-traversable NAT (i.e. not symmetric one). The workflow:
1. Once Spice client connects, it also establishes connection with
fake STUN server and instructs it about translation IP2->IP3.
2. Once softphone attempts to make or receive a call, it asks STUN
server for a candidate IP address - and receives IP3.
Fake STUN server implementation TBD.

Weak spots:
a) what about other applications relying on this 'STUN' server? They
won't work probably.
b) What if there is no NAT at all? Softphone will detect P2P
connectivity and won't use ICE probably...
c) What about softphones which do not support ICE?
I also thought about solution involving custom virtual network
drivers, but it seems to be impossible to split behavior for softphone
and for the rest of the system at this low level...

*VMC Ultimate*
This is the last step forward, to cover softphone which isn't based on
common media engine (probably things like Skype can be covered).
The main idea is to detect what media action a softphone is doing -
and actually do it at remote VMC engine. The approach assumes
signaling issue is solved somehow. Amended virtual audio, video and
network devices/drivers are needed. For example, let's look at
incoming audio call scenario, user picks up the call. The VMC Engine
at client detects presence of incoming audio stream, parses out the
codec (pcap?) and makes virtual network send sample stream to the
softphone. Once softphone starts playing out, virtual audio device
detects the decoded sample in the output. This is a sign for VMC
Engine to start playing out the incoming audio stream. And once
softphone starts reading a sample from virtual microphone and sending
encoded audio stream in the output, it can be detected at the virtual
network level and VMC Engine would start the actual outgoing audio
stream.

Same can be done for video - need to detect presence of decoded video
stream sample, and notify VMC Engine to start rendering - and the
stream should be rendered over the appropriate window using VMC
OverlayRenderer. Samples need to be really simple (e.g. black picture
with a label as a video stream) - so they should be easily generated
(or even pre-loaded) and detected, and processing of them should take
as less server CPU as possible (true not for all codecs - only for
those for which complexity of encoding/decoding depends on incoming
data parameters; for adaptive codecs like H.264 SVC, smaller supported
samples can be used than actual video stream decoded in VMC engine...)

Does this make any sense? I was inspired by current Spice detection of
video stream... Looks too complicated and risky probably.

Obvious hard case for all these VMC schemes - encrypted media streams
(usually SRTP).

-- 
Best regards,
Fedor