[xliff-tools] Another question on PO and XLIFF

Tue May 3 03:38:45 PDT 2005

Hey Asgeir,

On Tue, 2005-05-03 at 10:38, Asgeir Frimannsson wrote:
> Hi Tim, and welcome to the discussion :)

Thanks very much, I'm glad to be here :-) Before I get going, here's a
quick glossary entry :

I'll use '\n' to mean the actual <CR> character == \u000d and will use
"\n" to represent that string sequence of two characters, backslash and
n == \u005c\u006c.

> On Tue, 3 May 2005 17:44, Tim Foster wrote:
> > I tend to look at message files as containing the string that the
> > programmer wants to have in the message, not the string as it's finally
> > rendered on screen...
> 
> Does this mean that you can never have the actual newline character in a PO 
> fragment in your editor? (What happens if a translator presses <enter> when 
> editing a PO TU?) If not, what happens to the added newline characters on 
> back-conversion? 

Good questions : I suspect that the current behaviour is probably not
what I've outlined. If a user inserts a '\n' in a po entry in the
translation editor, then it'll get saved to the xliff file, but ignored
as part of the string during backconversion and pretty printing. In
order to get a "\n" string into the po entry, the user must actually
enter those two characters.

eg. user enters :

"This is a very very very very very very very very very very very very
very very very"
long\n
message
on
many
lines"

we'll get :

msgstr "This is a very very very very very very very very very very very
"
       "very very very verylong\n"
       "messageonmanylines"

I suspect that what we could do, is for the editor to check the
format-type of the XLIFF file being edited in order to have '\n'
characters translated into "\n" strings, but we'd need to give the user
some visual feedback in order for them to tell what mode the editor is
in (convert '\n' to "\n" or not)

I'm not sure that's an ideal state (from a translator usability
perspective) - any thoughts ?

> What if you have a msgid like:
> msgid ""
> "Here are the options: \n"
> "            -V  displays version information\n"
> "            -X  extracts magic information\n"
> 
> Am I right to assume this would become something like this in your filter 

> (Ignoring the fact that your segmenter would do magic here first):

(actually, we don't do segmentation in software message file formats at
the moment - the thinking being that each software message is a
particular unit of text : this doesn't always work unfortunately,
especially when developers start putting chunks of html-formatted text
in message entries (which pisses me off no end)) It's hard to argue how
to manage this (esp. since Solaris .po and many other formats don't give
you a clue as to what type of formatting the message strings actually
contain)

> <source>Here are the options :\n            -V  displays version information\n            
> -X  extracts magic information\n</source>
> 
> ...and then dynamically word-wrapped in your editor? 

Yes, that's the theory, though I'm not sure the editor takes advantage
of the pretty printing functionality (we're definitely using it in our
xliff backconverter though)

> > If you try to 2nd guess how the string is ultimately rendered, where do
> > you stop ? (Does your XLIFF converter need to understand what <b> and
> > <i> tags mean in HTML ?) 
> 
> There is a difference between markup and escaped characters. Newlines are 
> escaped characters, and if the target format supports the native characters, 
> why not use them? The only reason they are presented as characters is because 
> PO use the newline character for something else (and similarly in source 
> code). 

Okay, I'd argue that "\n" is actually a piece of markup as well - I
understand what you're saying, but maybe extend your argument to the
treatment of "<br>" markup in html files : do you wrap that in <g> xliff
elements, or convert it to carriage returns ?

I suppose it doesn't matter which you choose, just so long as you're
consistent across all XLIFF filter implementations. That is, once it's
consistent in the XLIFF file itself, the choices the editor makes are
cosmetic/usability related. What does the spec say about this sort of
thing ? 

We're currently wrapping "\n" in <it> tags at the moment (will change
that as soon as we move to XLIFF 1.1 and will adopt <g> instead) which
seems to me to give the best of all worlds : by marking up those escape
characters, we give fuzzy match algorithms the ability to do the right
thing with the formatting elements, but at the same time assure that
"\n" and '\n' are treated differently.

> It all comes back to the same question I guess: Should the representation 
> guides provide a common recommendation for handling newline characters in 
> software resource formats like PO and .properties?

:-)

> What we really don't want 
> to see is one open source editor displaying '\n' characters and another using 
> newlines - that would really confuse translators!

Well, perhaps - as I said, I'd be more in favour of having the xliff
file itself be consistent (as a first step) and then let editors choose
how to implement that - though again, having consistent editors would
also be a good thing. Just that I'd be careful as to making a difference
between how the string is ultimately rendered, and what it looks like in
the message file. My vote would be with keeping the "\n" strings.

	cheers,
			tim

-- 
Tim Foster - Tools Engineer, Software Globalisation
http://sunweb.ireland/~timf http://blogs.sun.com/timf
http://www.netsoc.ucd.ie/~timf