Clipboard/Selection Target Types

Wed Dec 31 22:30:21 EET 2003

On Wed, 2003-12-31 at 00:38, Dom Lachowicz wrote:
> Hi David,
> 
> Though what we've been doing to-date might be out of
> standard or even wrong, there is some precedent and
> agreement in this area amongst various application
> developers.
> 
> AbiWord, Gnumeric, Evolution, OpenOffice, Mozilla, et.
> al. have been using mime-type based atoms to
> interchange data. Specifically, some of the mime-type
> atoms we've been using include:
> 
> text/html
> application/xhtml+xml
> application/rtf
> image/png
> image/jpeg
> image/svg
> 
> IIRC, Mozilla posts UCS2 HTML fragments under the
> "text/html" atom. AbiWord (and, iirc, Gnumeric and
> Evolution) has been posting UTF8 HTML 4.0 there, and
> XHTML+UTF8 1.0 under "application/xhtml+xml". I know
> that OpenOffice pays attention the text/html HTML
> atoms, as rich-text cut+paste between Abi, Mozilla,
> and OO is possible.

Thanks for the information.

I wrote some test code in Conglomerate to try and see which of the above
target types are offered by the applications I've got installed (Red Hat
9).

Evolution (1.2.2) offered:
	- "UTF8_STRING" as UTF-8 (format reported as 8)
	- "text/html", appears to be UCS-2, format 16, though I got some
trailing junk characters.  Could be a bug in my code.  Seems to be a
well-formed subtree of the HTML source tree.  Appears to return
capitalised element names, which would be a problem for me - I'd prefer
strict XHTML. i.e. lowercase.

XEmacs (21.4) wouldn't offer any of the above, only text.

Mozilla offered:
	- "UTF8_STRING" as UTF-8, format 8
	- "text/html", appears to be UCS-2, but format=8, and be a fragment of
the document source, a well-formed subtree as far as I could make out.

OpenOffice.org Writer (1.0.2) offered:
	- "UTF8_STRING" as UTF-8, format=8
	- "text/html" as a UTF-8 document, format=8 consisting of a <!DOCTYPE
HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> declaration followed by <HTML>
element, <HEAD>, and a <BODY> containing the highlighted text; all tags
capitalised, with some stylistic information embedded in comments.
	  - "text/plain"; probably as UCS-2, judging by the length, though
format=8

AbiWord (1.0.4) didn't offer any of the above.

So there seems to be some disagreement for "text/html" 
(i) What is the encoding?  -  have I missed something here, or is there
no way of telling what encoding the data is in?
(ii) Should the data be a fragment of a document, or a full document?
(iii) Is there a way of insisting on XHTML for those of us who prefer
XML?

> 
> It would probably be beneficial to codify these
> techniques and put them into a freedesktop standard - 
>  especially for a format like HTML. Using HTML on the
> clipboard is a little odd, in that we'd probably have
> to discuss at least:
> 
> 1) An standard encoding
> 2) What to do wrt embedded entities (images,
> stylesheets, etc...)
> 3) Do we allow document fragments or only valid
> documents

I want to be able to support document fragments, rather than requiring
full, valid documents.   Here's the beginnings of a proposal for this:

Invent a new atom: "text/xml-fragment" to signify a fragment of XML
source data.

The data is a fragment of XML source, stored in UTF-8 encoding.  

It is in the "native" document type of the data provider, hence a
DocBook editor would offer a fragment of DocBook, an HTML editor would
offer XHTML, etc; it's up to the recipient of the data to convert it
into another document type if need be.

In order to allow lists of elements to be encoded, we need a "holder"
element that encloses the entire source.  This lets us re-use existing
XML parsing code - simply parse the UTF-8 buffer as a regular XML file,
and strip away the holder element after parsing to yield a list of
top-level nodes.

I suggest that we require this element to be <xml-fragment> to emphasize
the fact that this is a fragment of an XML document, rather than a full
one.  The XML fragment has to be well-formed, but doesn't have to be
valid (and probably won't be, due to this wrapper element).

Examples of legal data:
<xml-fragment>This is a <emph>simple</emph> example.</xml-fragment>
<xml-fragment>This examples contains a comment <!-- Hello world -->
which hopefully should be supported sanely.</xml-fragment>
<xml-fragment><h1>Chapter 1</h1></xml-fragment>

Undecided issues:
(i) Can we allow a document type declaration before the xml-fragment
element? (perhaps optionally)
(ii) If so, should the DOCTYPE supply full details of all of the
entities defined in the document?  What do we do about entities?
(iii) should we require an <?xml ?> header?  If we do, can we relax the
UTF-8 requirement and allow normal rules about XML encoding?

Hope the above makes sense.  I'm not convinced about it myself, though
it seems to cover most of the areas I need.

Alternatively, we need to specify _exactly_ what should go into
"text/html" - currently the encoding issue is a blocker for me.  Should
I write up the various issues and behaviours somewhere?

Dave

> 
> I'm willing to help out here as much as is needed.
> Something like "COMPOUND_TEXT" isn't flexible enough
> to describe the large variety of formats these complex
> applications can support, and sniffing clipboard
> contents is both prohibitive and restrictive.
> 
> Best regards,
> Dom
> 
> --- Dave Malcolm <david at davemalcolm.demon.co.uk>
> wrote:
> > Does Freedesktop.org maintain a list of "well-known"
> > X selection target
> > atoms?  Or can someone direct me to an up-to-date
> > one?
> > 
> > The lists I've found so far are: 
> > (i) in section 2.6.2 of the ICCCM here:
> > http://tronche.com/gui/x/icccm/sec-2.html
> > (ii) ftp://ftp.x.org/pub/R6.6/xc/registry
> > 
> > These don't seem particularly up-to-date or
> > comprehensive, for instance,
> > what target should I use for fragments of HTML?  (or
> > am I just being
> > dumb?)
> > 
> > In particular, I'm writing an XML editor
> > (www.conglomerate.org) which
> > can support multiple XML formats, although it's
> > primarily aimed at
> > DocBook editing.
> > 
> > I want to interoperate with other XML editors, and
> > with web browsers. 
> > I'd like to be able to distinguish between different
> > XML types when
> > copying/pasting fragments of XML source.  For
> > example if a user pastes a
> > fragment of Kernel Cousin XML Source into a DocBook
> > document, I'd like
> > the program to be able to convert all the <li> tags
> > into <listitem> tags
> > automatically.  Also, when the user copies a
> > fragment of DocBook source
> > to the clipboard, it would be nice to be able to
> > offer it as HTML in
> > case they want to paste it into an email client.
> > 
> > So:
> > (i) Is there an atom for "fragment of XML source",
> > with some agreed-upon
> > encoding (probably UTF-8)?
> > (ii) Is there a smart target atom that's more that
> > just "XML Source" and
> > that carries some kind of DTD information - perhaps
> > the Public ID of the
> > DTD?
> > (iii) Is there an agreed-upon atom for fragments of
> > HTML source?
> > 
> > Should I write some kind of proposal for (ii) above,
> > similar to the one
> > for UTF8_STRING?
> >
> http://www.pps.jussieu.fr/~jch/software/UTF8_STRING/UTF8_STRING.text
> > 
> > Thoughts?  Links?  Flames?
> > 
> > -- 
> > David Malcolm
> > www.conglomerate.org
> > 
> > _______________________________________________
> > Xdg-list mailing list
> > Xdg-list at freedesktop.org
> > https://www.redhat.com/mailman/listinfo/xdg-list
> 
> 
> __________________________________
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
> _______________________________________________
> Xdg-list mailing list
> Xdg-list at freedesktop.org
> https://www.redhat.com/mailman/listinfo/xdg-list
-- 
David Malcolm
www.conglomerate.org