[Spice-devel] RFC: Virtual Media Controller (VMC) concept: enhancing IP telephony systems in Spice VDI

Tue Oct 15 21:27:59 CEST 2013

----- Original Message -----
> Marc-André,
> 
> Thanks for the comments! I'll certainly follow your advice. About wiki
> - how to create account there? Tried
> http://www.spice-space.org/wiki/index.php?title=Special:UserLogin&returnto=Main_Page

Same for me, spice-space.org was recently moved to a different server, that might explain it.

Anyway, you can use other tools (google docs), then it can later be referenced or copied in spice-space.org wiki.

> but it returns empty page for me...
> 
> Few comments below.
> 
> 
> On Tue, Oct 15, 2013 at 2:09 PM, Marc-André Lureau <mlureau at redhat.com>
> wrote:
> >
> >
> > ----- Original Message -----
> >> Hello Spice developers,
> >>
> >> I want to introduce my idea of Virtual Media Controller (VMC),
> >> enhancing support for IP telephony in Spice-based VDI. Hope for your
> >> feedback!
> >> This is a concept only, very high level and without any Proof of
> >> Concept implemented yet. The concept is divided into 3 levels: VMC
> >> API, VMC Advanced and VMC Ultimate. The level of unknown increases
> >> with them. VMC API seems to be fairly straightforward and doable,
> >> others are more risky and already have known open issues questioning a
> >> possibility to implement them (and probably much more I'm not aware of
> >> yet).
> >>
> >> *Why VMC is needed?*
> >> Main problem of IP telephony software running in VDI is media stream
> >> hairpinning at VDI server. Let we have Alice and Bob working in office
> >> O2, they're connecting to their virtual desktops at VDI server in a
> >> main office O1. They're using softphones running at the VMs to make a
> >> p2p audio call (video part behaves similarly, just adds unnecessary
> >> complication to this example). Let's look at the route of Alice's
> >> outgoing audio stream. Audio is captured from microphone at her
> >> PC/thin client, then encoded by Spice client and sent over Spice
> >> channel to VM, where it is decoded to PCM and presented as a source
> >> for virtual microphone. Softphone then encodes it once again and sends
> >> it to a peer, softphone of Bob running at another VM of the same VDI
> >> server. The stream is decoded once again, played into virtual speaker,
> >> which in turn encodes it and sends over Spice channel to Spice client
> >> at Bob's PC, where it is decoded and finally played out in a real
> >> headset.
> >>
> >> We can see 2 major issues with this scheme:
> >> 1. Media stream is traveling via VDI server, not p2p. So even if 2
> >> people in office O2 are making a call, the traffic goes through VDI
> >> server at office O1. This introduces extra delay into the
> >> conversation, potentially increases jitter and packet loss (depends on
> >> network), and this results in extra network load.
> >> 2. Media stream is transcoded (decoded and then encoded) at VDI server
> >> twice (if count Bob's stream, 4 times!). This means extra CPU usage of
> >> VDI server, effectively reducing VM density. This also means
> >> degradation of quality if lossy codecs are used.
> >>
> >> *VMC Solution*
> >> The most adequate solution to both issues is to make the conversation
> >> p2p, remove VDI server from the route entirely. So the question is
> >> only how to actually make it.
> >> First part of the VMC idea is to introduce a media engine at client
> >> side, and API for softphone developers to manipulate this engine. We
> >> may think about following components:
> >> 1. VMC Agent - a component providing media-handling API for
> >> applications running at this virtual machine. [Probably would need to
> >> work through Spice agent, or via similar means - adding new virtual
> >> device to qemu]
> >> 2. VMC Engine - media engine running at user's client/PC. It provides
> >> actual media handling and is controlled with commands from VMC Agent.
> >> 3. VMC Transport - a "component" implementing connection between the
> >> agent and the engine. Actual design is TBD. This is some sort of RPC
> >> over Spice connection.
> >> 4. VMC OverlayRenderer - this advanced component is needed for video
> >> support only. It integrates local video rendering inside virtual
> >> session window.
> >>
> >> Softphone developers would need to use VMC Agent API as a media engine
> >> for their application - so changes in the softphone are required.
> >
> > Although I don't know in details Telepathy, it looks like what you
> > describe. Except that audio/video stream is proxyed, and decoded directly
> > in client. Is that correct?
> >
> 
> What do you mean saying 'media stream is proxyed'? We need P2P
> connections, avoiding any proxy servers (within an IP network, that
> is). So the stream is delivered directly and decoded in the client,
> totally by-passing VDI server.

This is not always possible, so I suggest to start with proxying before doing p2p.

> > 1. agent: telepathy session & API
> 
> I have a generic agent in mind, not tied to particular softphone...
> What about D-Bus-based common API and GStreamer, VLC and Google WebRTC
> VoE&ViE wrappers as shared libraries? At start of softphone/media app,
> it just links to our .so instead of its normal media engine - and gets
> everything working, not even knowing that media is processed at the
> client...

I think telepathy was supposed to be very generic (in fact, it was supposed to be just an interface spec iirc), but given the complexity of voip stack, it's just a dream. But feel free to propose something else, I was basically making an analogy.

> 
> > 2. engine: gstreamer
> > 3. tbd (rtp?), dedicated spice channel
> 
> Not RTP certainly, as this channel isn't for media transfer but for
> RPC - agent calling functions of remote media engine.
> 
> > 4. internal of spice client
> >
> Agree.
> 
> >> *VMC Advanced*
> >> More general problem can be set: make arbitrary softphones running at
> >> VMs work without VDI hairpinning. Arbitrary means without code changes
> >> in these softphones. Solution of this problem adds much more value to
> >> Spice VDI, as any third-party applications, including commercial ones
> >> like Skype, would be supported. Skype may be bad example... But modern
> >> enterprise SIP or H.323-based softphones may be a good one (MS Lync to
> >> name one).
> >>
> >> But first of all, let's look into an interesting yet mostly
> >> non-commercial case - Linux VM. For this case there is a chance of
> >> implementing Agent API to follow APIs of widespread media engines -
> >> GStreamer, VLC (what else?). This way we'd be able to support
> >> arbitrary media apps based on these engines.
> >
> > Yes, we discussed about this for video-passthrough. Having a GStreamer
> > passthrough would be quite awesome, although limited to very few use cases
> > unfortunately, since most of the time the decoded video is post-processed,
> > and there are relatively few GStreamer apps among all the media apps.
> > Also, you have issues on client side, like codec support (which can be
> > discarded by saying that spice doesn't ship the problematic codecs itself,
> > but then the story is not fun for windows and mac users).
> >
> >> [Notes:
> >> 1. If we add Google WebRTC media engine bindings, softphone developers
> >> who use this API should be able to add support for our system fairly
> >> straightforwardly.
> >> 2. GStreamer, VLC, WebRTC are cross-platform, so implementing their
> >> API may help with enabling support of some softphones at Windows VMs
> >> as well]
> >>
> >> One major part that needs additional work is signaling. The issue is
> >> following: when a communication channel is established between 2
> >> parties, they exchange their IP addresses in the signaling messages.
> >> Softphone at VM will advertise its virtual IP address in such
> >> situation - but we need to make the client to be the receiving end, so
> >> we need client's IP address to appear in softphone's message. And we
> >> want our solution to be as signaling protocol agnostic as possible,
> >> i.e. parsing and changing IP address in signaling messages isn't an
> >> option (and signaling traffic is usually guarded by TLS connections
> >> anyway). Dealing with this is big open question (up for a networking
> >> guru!) . I'd love any comments / possible solutions for this!
> >>
> >> How I see this problem:
> >> 1. In VDI server with real address IP1 there is a VM with some address
> >> IP2 (NAT or not - not specified)
> >> 2. At the VM, an application is running (softphone)
> >> 3. User connects via Spice client, from a client/PC with address IP3
> >> 4. Need to trick the softphone into thinking it is running at the
> >> machine with IP3
> >> 5. The softphone signaling should continue to work normally otherwise
> >> 6. All other applications at the VM should continue to work normally
> >> using address IP2
> >> Variables to play with: NAT, virtual network driver, configuration of
> >> softphone (we can expect it uses particular ports for media)
> >>
> >> The only seemingly implementable idea of mine works only for softphone
> >> which supports ICE (NAT traversal), and only for the case when there
> >> is STUN-traversable NAT (i.e. not symmetric one). The workflow:
> >> 1. Once Spice client connects, it also establishes connection with
> >> fake STUN server and instructs it about translation IP2->IP3.
> >> 2. Once softphone attempts to make or receive a call, it asks STUN
> >> server for a candidate IP address - and receives IP3.
> >> Fake STUN server implementation TBD.
> >>
> >> Weak spots:
> >> a) what about other applications relying on this 'STUN' server? They
> >> won't work probably.
> >> b) What if there is no NAT at all? Softphone will detect P2P
> >> connectivity and won't use ICE probably...
> >> c) What about softphones which do not support ICE?
> >> I also thought about solution involving custom virtual network
> >> drivers, but it seems to be impossible to split behavior for softphone
> >> and for the rest of the system at this low level...
> >
> > It looks like you have thought about this a lot. You should start
> > documenting this on a spice-space wiki
> > http://www.spice-space.org/page/PlannedFeatures.
> >
> > Also, since this feature is quite specific to voip-domain, I think it is
> > best to ask voip people about the tricks you can do, on Telepathy or Ekiga
> > mailing list for example.
> >
> >> *VMC Ultimate*
> >> This is the last step forward, to cover softphone which isn't based on
> >> common media engine (probably things like Skype can be covered).
> >> The main idea is to detect what media action a softphone is doing -
> >> and actually do it at remote VMC engine. The approach assumes
> >> signaling issue is solved somehow. Amended virtual audio, video and
> >> network devices/drivers are needed. For example, let's look at
> >> incoming audio call scenario, user picks up the call. The VMC Engine
> >> at client detects presence of incoming audio stream, parses out the
> >> codec (pcap?) and makes virtual network send sample stream to the
> >> softphone. Once softphone starts playing out, virtual audio device
> >> detects the decoded sample in the output. This is a sign for VMC
> >> Engine to start playing out the incoming audio stream. And once
> >> softphone starts reading a sample from virtual microphone and sending
> >> encoded audio stream in the output, it can be detected at the virtual
> >> network level and VMC Engine would start the actual outgoing audio
> >> stream.
> >>
> >> Same can be done for video - need to detect presence of decoded video
> >> stream sample, and notify VMC Engine to start rendering - and the
> >> stream should be rendered over the appropriate window using VMC
> >> OverlayRenderer. Samples need to be really simple (e.g. black picture
> >> with a label as a video stream) - so they should be easily generated
> >> (or even pre-loaded) and detected, and processing of them should take
> >> as less server CPU as possible (true not for all codecs - only for
> >> those for which complexity of encoding/decoding depends on incoming
> >> data parameters; for adaptive codecs like H.264 SVC, smaller supported
> >> samples can be used than actual video stream decoded in VMC engine...)
> >>
> >> Does this make any sense? I was inspired by current Spice detection of
> >> video stream... Looks too complicated and risky probably.
> >>
> >> Obvious hard case for all these VMC schemes - encrypted media streams
> >> (usually SRTP).
> >
> > Indeed, that's what I was going to ask ;)
> 
> AFAIK, most open-source softphones do not have SRTP yet... More of an
> enterprise feature, and fairly limitied, as it guards only "first
> hop", not the whole conversation.
> Need to think about API-based solution for SRTP, e.g. encrypt/verify
> plug-in at client, with agent API to provide the keys.
> 
> --
> Best regards,
> Fedor
>