[xliff-tools] xml to xliff

Tim Foster Tim.Foster at Sun.COM
Mon Apr 4 22:49:17 EST 2005


Hey Josep,

On Mon, 2005-04-04 at 12:31, Josep Condal wrote:
> Paragraph level segmentation is more conservative, sentence level is
> more risky.

Yep, I hear ya - and agree. In the implementations we have for
documentation-like formats, we try to use whatever hints we can from
the source file format to "chunk" sections of translatable text and
then look for clues to determine which areas of text should be
protected from the segmenter, so :

<p>This is a piece of html that has <code>System.out.println("Hi
there!");</code> some java inline and some <b>bold</b> text.</p>

would be treated as one segment - we have a 2 layer segmenter : the first
is customised to the source file format, and the 2nd is used across all
formats, and gives us consistent segments. Given the same section of text
as just plain ascii without markup for segmentation and we'd get some pretty
weird segments..

Now, that said, the 2nd level segmentation routines we're using at the 
moment are pretty simple (done using javacc (lex/yacc in java)) but at
some stage, we could switch over to use full-blown NLP techniques to segment
sentences (parse the paragraph into sentences, looking for verb, subject, object,
noun-phrases, that sort of thing) : that's a bit tricker though, and we
just haven't got the resources for that right now.

> To get an idea of what I mean, if you make a segmentation at character
> level of the Bible, you get a nomimal word count with 26 words and a few
> tens of millions nominal repetitions. If you negotiate the repetitions
> value aggresively, you may get a good price as there are only 26 new
> words ;) 

Nice analogy :-)

> For example, if the writer of the original text puts a not
> known abbreviation, the segmenter may break the segment in the middle of
> the segment and meaning unicity of segment is lost.

Yep, absolutely ! In most cases, we can rely on some of the other hints
in the source document to spot areas that could cause problems for the segmenter
but we don't catch all of them - so I understand the concerns, likewise,
they're not always marked up as per the example I gave above...

Now, in contrast, our software message file filters (for .java, .po,
.properties and .msg) use the entire message as a segment : we don't try to
chop those messages up any more, since often as you say, 1 message == 1
segment, though this isn't always the case.

It's a difficult judgement to make, but for documentation, where you have
a "book-like" section of text to translate, we choose sentence-level
segmentation, whereas for message files we leave it at the level of the
message file.

Hope this is of interest ?

	cheers,
			tim


> 
> 
> -----Mensaje original-----
> De: xliff-tools-bounces at lists.freedesktop.org
> [mailto:xliff-tools-bounces at lists.freedesktop.org] En nombre de Tim
> Foster
> Enviado el: lunes, 04 de abril de 2005 13:08
> Para: cobaco (aka Bart Cornelis)
> CC: xliff-tools at lists.freedesktop.org
> Asunto: Re: [xliff-tools] xml to xliff
> 
> Hi cobaco,
> 
> On Mon, 2005-04-04 at 11:32, cobaco (aka Bart Cornelis) wrote:
> > On Monday 04 April 2005 09:59, Tim Foster wrote:
> > > a segment/msgid out of each paragraph, vs. ours that create a 
> > > segment/msgid our of every sentence.
> 
> > hm, I'm not at all sure that's a good idea:
> > the smaller the granularity of the to-be-translated bits, the harder 
> > non-literal translation becomes, and especially for documents that can
> 
> > make a large difference in the quality of the translation.
> 
> Define "quality" (only joking!)
> 
>  - but seriously, we've been using sentence-level segmentation at Sun
> for all of our docs material for the past 3 years (with our internal
> tools, no idea what the translation vendors we were using before were
> using wrt. paragraph vs. sentence segmentation) and have found that it's
> really not a problem. Linguistic reviewers have been generally happy
> with the quality of Sun documentation.
> 
> Now, I suspect that part of this could be due to the excellent technical
> writers we have and some style-checking tools which are used in the
> authoring process to catch sentence-structures that may be difficult to
> translate.
> 
> Along with that, since the translators are always shown sentences in
> their correct context wrt. other sentences in the paragraph, and can
> choose multiple different translations for the same sentence (based on
> the book name, product name, part number and other attributes) this
> seems to work okay. Of course, only allowing one possible translation
> per source sentence would result in a very poor quality translation : we
> don't do that.
> 
> 
> I'm not a translator or a linguist, so I can't argue the finer points of
> this, except to say that we haven't found it to be a limiting factor at
> all and customers haven't been complaining about our translation
> quality.
> 
> (docs.sun.com I think has some translated books, if you want to check
> them out )
> 
> 
> 	cheers,
> 			tim
> --
> Tim Foster - Tools Engineer, Software Globalisation
> http://sunweb.ireland/~timf http://blogs.sun.com/timf
> http://www.netsoc.ucd.ie/~timf
> 
> _______________________________________________
> xliff-tools mailing list
> xliff-tools at lists.freedesktop.org
> http://lists.freedesktop.org/cgi-bin/mailman/listinfo/xliff-tools
> _______________________________________________
> xliff-tools mailing list
> xliff-tools at lists.freedesktop.org
> http://lists.freedesktop.org/cgi-bin/mailman/listinfo/xliff-tools
-- 
Tim Foster - Tools Engineer, Software Globalisation
http://sunweb.ireland/~timf http://blogs.sun.com/timf
http://www.netsoc.ucd.ie/~timf



More information about the xliff-tools mailing list