[AppStream] The AppStream XML locale apocalypse
Matthias Klumpp
matthias at tenstral.net
Thu Apr 6 16:15:41 UTC 2023
Hi!
While investigating why zh_TW/zh_CN translations were not showing up
in AppStream-based software centers, we found out that this was due to
the locale being listed as "zh-TW" in the XML. This was first noticed
at KDE, which was going to edit their tools to use POSIX locale in the
XML instead[1].
I looked into the matter and found out that when using XML, the
contents of the xml:lang tag are not arbitrary or UNIX locale though,
but need to follow the IETF BCP47 specification.
So by assuming POSIX locale, AppStream was doing the wrong thing for a
really long time!
We only didn't notice this so far for two reasons: One, AppStream's
locale matching is quite good and will fall back to country codes if a
dedicated translation wasn't found, which coincidentally is where
POSIX and BCP47 locale are pretty much identical. So the issue wasn't
as noticeable unless you were from a language depending on the
territory specifier.
Secondly though, GNOME's tooling is also wrong in many cases and seems
to be using POSIX locale while BCP47 locale should be used.
I thought quite a while about this, and I think the best worst thing
to do here is to make AppStream use BCP47 locale, with hopefully not
too much breakage..
I hate making this change (it complicates locale handling in AppStream
quite a bit, especially since BCP47/POSIX locale can't easily be
mapped 1:1 (see[3])), but I think doing it is the most sane change.
Just ignoring the IETF specification and having POSIX locale there
seemed attractive at first, but a lot of tools that translate XML do
not handle this well and will continue to output BCP47 locale, such as
the ones used by KDE and itstool. So even if KDE would switch, we
would set a trap for many other projects using alternative translation
solutions, with no easy way for projects to solve this.
Making this change will cause problems for projects which also did it
"the wrong way", but on the other hand it will fix a real bug for
projects which were using BCP47 all along.
Since there is a specification for this, I think not following
established practices for XML is a pretty bad choice.
So, I implemented a locale mapping algorithm in AppStream that
translates POSIX to BCP47 based on the same rules that itstool[4] uses
to do the same task. That will work for pretty much all cases, I hope.
AppStream will also do its best to find good locale in case someone
did use POSIX in xml:lang, but for performance reasons I can't
implement anything that just accepts both locale - that would not only
be a lot of engineering work for little gain, but it would also slow
down parsing for everyone. I may adjust appstream-compose though the
correct wrong locale, which should also fix the issue for people
before it reaches users.
All of the changes are not yet merged into master, but I intend to do
that soon, unless there are objections or feedback that I haven't
considered.
So, what do you think? It's a pretty bad situation to be in, but it
needs to be addressed somehow.
Cheers,
Matthias
P.S: I think POSIX locale are better than BCP47, their format is just
a lot more versatile and expressive and can easily be split into
individual language/territory/modifier parts, where it's less clear
for BCP47. I wish we would have only one format to identify locale,
but that ship sailed in 1995/2001.
[1]: https://invent.kde.org/sysadmin/l10n-scripty/-/merge_requests/61
[2]: https://en.wikipedia.org/wiki/IETF_language_tag
[3]: https://wiki.openoffice.org/wiki/LocaleMapping
[4]: https://itstool.org/
--
I welcome VSRE emails. See http://vsre.info/
More information about the AppStream
mailing list