Q: Is [collaboration] of some interest? [yes!]

Sat Mar 24 09:59:47 PDT 2012

On Mon, Mar 12, 2012 at 1:26 PM, Eike Rathke <erack at redhat.com> wrote:
> Hi Riccardo,
>
> On Thursday, 2012-03-08 10:39:47 +0000, Michael Meeks wrote:
>
>> > For several reasons (let me skip them), at my University we are
>> > thinking about starting a project involving ODF and we would like to
>> > know if there is already something similar to what we would like to
>> > produce and it the  LibreOffice community could have some interest in
>> > it.
>>
>>       Wonderful - there is already some similar work underway that Eike is
>> looking into, it would be great to have you work with him.
>
> Right, let's coordinate things a bit. Seems this topic comes up more
> frequently recently, we should really avoid conflicting approaches.
>
>
>> > In our vision, each user has control about a "section" of the text
>> > (for simplicity, we are aiming mainly to text documents) and every
>> > changes made to the document are propagated to all the users currently
>> > on-line (with an experience similar to "Google Docs").
>>
>>       Right,
>>
>> >   The difference with Google Docs is that the document is not in some
>> > fuzzy "cloud," but on the user's disk and the user can edit it while
>> > off-line, encrypt it, ...  If the user does some changes while off-line,
>> > the other copies will receive the updates as soon as the user returns
>> > on-line.  Different copies will exchange updates in a peer-to-peer
>> > fashion, without the need of a centralized repository (the bazaar/git
>> > flavor).
>>
>>       Ok - so, (I hope) our focus first would be the on-line co-editing, and
>> then use/fall-back to (and improve) the document merging / comparison
>> functionality to do on-line/off-line merges.
>
> That's what I had in mind as well. My approach would be to use
> a change-tracking enabled document, because (besides that it gives you
> the benefit of being able to display who changed what when) then
> actually only trackable (content, not attributes) features are enabled,
> and the current file based collaboration (aka shared document) uses this
> mode as well, as does the compare/merge document feature. It talso
> provides functionality to accept/reject changes. Note that I'm Calc
> biased here, Writer doesn't have the shared document feature yet, though
> it does have change-tracking.
>
>
>> > 1.  Are you aware if this type of capability is already available (I
>> > do not think so) or currently developed?
>>
>>       There is work underway, to bootstrap this via instant messaging, and
>> particularly the Telepathy framework - it would be great to make that
>> more public / visible and get more hands onto playing with it. Eike - do
>> you have something that could go into a feature branch ? :-)
>
> I can commit the remainders of what I have (threw away the first
> unpromising approach) to a feature branch this week.
>
>
>> > 3. Do you have some general suggestions for us?  Especially about
>> > interfacing the rest of the developers.
>>
>>       So - first, talk to Eike (preferably CC'ing the list here). Second -
>> here is what I was trying to persuade Eike was a sensible way of doing
>> it (which he's prolly detected as insane already ;-).
>
> Actually not that much ;-)
>
>> Please bear in
>> mind we're starting with calc here ...
>>
>>       Here are my thoughts:
>>
>>       * It doesn't matter what you do to the document, as long as
>>         everyone's document does the same thing.
>>
>>       * Thus - whatever protocol you use, it needs to enforce hard
>>         ordering, such that edits 'A1', 'B1', 'A2' 'C1' end up in
>>         the same order for A, B, and C regardless of latency /
>>         topology etc.
>
> This is absolutely a must, especially when it comes to edits that move
> things in the document, such as inserting/deleting rows/columns or
> moving cells.
>
>>       * Jabber provides this guarentee :-) and a beautiful way of
>>         bootstrapping communication from an existing communication
>>         tool: telepathy/empathy/IM
>
> Yup, and a hard one to deal with..
>
>>       * Those edits need to do -exactly- the same thing, ie. we'd want
>>         the same major version of LibreOffice at each end.
>
> I'd rather version the collaboration feature, so each end can
> announce/handshake on the minimum collaboration version required,
> instead of tie it to the LibO version.
>
>>       ** But ** - and here is where the work starts
>>
>>       * We need to ensure that all edits to the document are not
>>         applied immediately, but described and dispatched to the
>>         Jabber server, and only the events returned are applied.
>>
>>       * This means we need a -clean- Controller <-> Model split
>>         which we currently don't have ;-) -although- some things
>>         are really quite pleasant, eg. dialogs often tend not to be
>>         instant apply, and to collect up their changes into
>>         abstract SfxItemSets (PropertyBags to you and me) so with
>>         work we can tease out the controller perhaps.
>
> That would be a long run, but yes, at the end that's probably what we
> want.
>
>>       * And of course, some thinking of good ways of managing
>>         cursor locations, and transmitting other people's
>>         movement around documents to maintain sensible editing state
>>         is necessary.
>
> I don't think tracking cursor locations is needed. An edit action would
> be transmitted as "at position (or range) so and so do this and that".
>
> Maybe locking a region to announce "I'm going to edit here" would come
> handy to prevent clashes.
>
> In Calc, the ScDocFunc provides almost what's needed and is already used
> by UI and API (not consistently, but to a great amount), feeding it from
> edit actions as an intermediate layer should be possible. This again
> made me think of reusing the existing API and serialize it through
> online editing, not sure how far we could go there, but once the basics
> were implemented we'd cover a great deal of functionality almost at no
> cost.
>
>  Eike
>
> --
> LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
> GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

Michael, Eike
sorry for the long silence.  I wanted to write a thoughtful reply and
finally I got a timeslot long enough (I am doing an 8-hours train
trip.  That should be plenty).

Reading your mails, I understood that maybe you and I have different
models in mind.  Let me describe mine.  Take a comfortable chair...
:-)

A first difference between my model and yours is that yours has a
"Google doc" flavor where everyone can edit everything  (although Eike
suggests a region locking), while in my model different document
regions are assigned to different editors and each editor can modify
only his own region.  Although this could seem less "elegant" than the
Google approach, my personal experience (I often write documents
[project proposals, papers, ...] in collaboration with others) is that
usually you resort to some form of "informal locking,"  saying, for
example, that you will take care of the introduction, Alice of the
state of the art and Bob will draw the GANTT chart.  So, maybe is more
convenient to transform that informal locking into a true one,
enforced by the editing software.  This would also solve the problem
of serializing the changes.

In the  locking model that I have in mind, the person that creates the
document becomes the "owner" of the document.  The whole document is
covered by a single region and the editor of the region is the owner
himself.  Portions of regions can be "given" to other editors by the
editor in charge of the region, but in an emergency case (say, the
editor is sick) a region can be "taken away" by the document owner
(i.e., the person who initially created the document).   Note that
this model semi-centralizes the changes to the region layout and this
could make synchronization simpler.

What if someone sees a mistake in a section different from his?  A
solution (very, very simple) could be to write an e-mail to the editor
in charge... if we want something more sophisticated, we can allow
anyone to add "proposed changes" to a non-owned section.  I'm thinking
something like to a comment added to a section.  Since different
comments do not interact each other, the problem of serializing the
changes becomes much simpler.  If we can make this like a special type
of comment, we can allow the section owner to accept the suggested
changes by just clicking on a button.  (Please note that I do not
anything about the internals of LO or ODF, so what I am suggesting
could be almost impossible...)

Maybe another important difference is that you are thinking about
showing  "in real time" to each editor the changes made by the other
editors.  Instead I am thinking about transferring the changes in a
"batch" mode, not sending the "editing commands," but updates in the
actual structure of the document. A rough description of what would
happen is the following:

  * Each section has a version number that it is increased as the
editor  makes changes (when a new version number is created?  I am not
sure yet, maybe after a given number of changes, maybe after a given
amount of time, maybe when the user saves the document or when the
user explicitly requires a new version).

  * The editors' PCs form a network of node that communicate with
something "multicast-like", in the sense that everything a node puts
on the network it is received by every other node.  I used
"multicast-like" with quotes to emphasize that it is not necessary
that the communication happens over a "true multicast" protocol, but
it can use many other solutions such as multi-unicast, the use of
centralized actor (Jabber?), an overlay-multicast protocol.

  * When the user goes on line, the PC tries to join the network of
editors.  If it succeeds, it sends over the network a description of
the version numbers of the sections  (e.g., "I have version 12 of
Bob's section, version 42 of Alice's session, ...")

  * Suppose a node (say, Alice's PC) receives a section description
with version number N_remote.  Let N_local be the version number of
the version on Alice's PC. Alice's editor acts as follows

      (a) If the N_local == N_remote, the node does nothing.

      (b) If the N_local < N_remote (i.e., the local version is
older), the node sends over the network N_local

      (c) If the N_local > N_remote (i.e., the local version is
newer), the node sends over the network "update data."

Note that if Alice made some changes off-line, when she comes back
on-line, the other editors will have an older version.  So, when Alice
broadcasts her version number, since the other nodes will have a
smaller version number, they will execute (b) sending their own
version numbers.   When the replies of the other nodes will be
received by Alice, she will be in case (c) and will begin sending
updates to the group.

An interesting consequence of this approach is the following:

  * Suppose Alice worked off-line and now her version has number 42,
while Bob and Charlie still have version 40 of Alice's section.

  * Now Alice goes on-line and finds Bob (but not Charlie).  By the
above protocol, Alice sends update data to Bob until Bob has version
42 too.

  * Now Alice leaves and Charlie arrives.  By the same protocol, Bob
now is able to update Charlie with version 42.

In other words, every document is able to update every other document.
 In this sense I say that this approach has a "git" (or "bazaar")
flavor.

Things can get a bit more complex if more than an old version is
present. Consider the following case:

  * Suppose current Alice's section has version number 38 and that Bob
and Charlie are up-to-date.

  * Suppose Charlie goes off-line, keeping version 38 of Alice's section.

  * Alice and Bob continue editing on-line.  When Alice leaves her
section (shared with Bob) has version 40.

  * Alice does some editing off-line and reaches version 42.

  * Alice goes on-line again, Bob joins her and Alice begins sending
update data to go from version 40 to version 42.

  * Now Charlie arrives.  Now Alice must send updates both to Bob
(40-42) and Charlie (38-42).  Note that Charlie cannot use the data
sent to Bob until he reaches version 40 too.  Also, since Charlie
arrived late he lost some data sent to Bob and Alice will need to
resend them too.

An alternative solution that can make things simpler in the presence
of different old versions is the following, based on an idea similar
to fountain codes

  * Alice computes an hash of the content of her section (MD5 would be
fine, since we do not use it for security)

  * Alice computes "linear combinations" of the content and
distributes them to the other nodes

  * It is possible to show that as soon as a node receives "enough
data" it can recover the current version of the section.  "Enough
data" here means, roughly, the minimum required amount of data plus a
small overhead.  It is also possible to show that both Bob and Charlie
can use the same data, Charlie, having the oldest version, will just
need to listen for data a bit longer.

The advantage of this approach is that the same update data can be
used by any node, independently on the owned version. Things here get
complex and I would skip here, for now, over all the (quite gory :-)
details.

Finally, few words about the underneath structure, as I can imagine it
(without knowing if it is compatible with LO internal structure).  Let
me do some ASCII-art

      ----------
     /          \
     | Document |
     \__________/
          ^
          |
          v
     +------------+    User
     | Editing    |  commands  +---------+
     | "engine"   |<-----------|   GUI   |
     +------------+            +---------+
         ^   ^
         |   |
         |   +-----------+
         |               |
         v               v
     +----------+    +----------+
     | Section  |    | Update   |
     | Locker   |    | Manager  |
     +----------+    +----------+
        ^                    ^
        |                    |
        |   +------------+   |
        +-->| Para       |<--+
            | Multicast  |
            +------------+
                  ^
                  |
                  v
             To the network

A brief explanation of the blocks

  * The "Editing engine" is the software part that actually acts on
the document, making changes to its content.  In order to emphasize
it, we shown that the "editing engine" takes request from the user via
the GUI

  * The editing engine interfaces itself with two new blocks

      + The "section locker" takes care of the protocol used by the
nodes to distribute the changes in the section structure (for example,
when part of a section is assigned to a different editor).  The
section locker receives "section change requests" both from the user
(and distribute them to the other nodes) and from the network (and
move the requests to the engine).

      + The "update manager" takes care of the update protocol
outlined above.  It receives from the editing engine a "description"
(more about this later) of the current content of the section and, if
necessary, distributes updates to the other nodes.  Moreover, dually,
updates received from the network are communicated to the engine that
will update the document accordingly.

  *  Finally, the "para multicast" block shows to the other two blocks
a multicast-like API, hiding to them how that multicast is achieved
(e.g., multi-unicast, overlay multicast, true multicast, ...)

Note that the engine decouples the document structure from the section
locker and the update manager.  The data format used by the engine to
communicate contents to the update manager can be different from the
actual document format.  This could solve (or simplify) the issue with
different software versions, since the only block that needs to
understand the document format is the engine.  I am aware that the
definition of this "intermediate format" is not a banal issue, but it
could prove an important tool.  Note that all the nodes in the network
must agree on this intermediate format since the exchanged updates
will be relative to this intermediate format.  Because of this, it is
important that the format is sufficiently general for not requiring
frequent updates.  Also, compactness will be a plus. (BTW, nothing
prevent us from using ODF, if it suits us).

OK, I hope you are still with me and you did not fall (asleep) from
your chair... ;-)

This is the model that I had in mind, model that I thought without
knowing anything about LO structure and just a very tiny bit about ODF
format, so maybe there are some deep incompatibilities.   I like it
because of its simple structure and the "symmetry" between the nodes
(the "git" flavor), but, of course, there is nothing sacred about it.

Riccardo