optimising OUString for space

Mon Oct 1 04:47:59 PDT 2012

On 01/10/12 13:02, Noel Grandin wrote:
> 
> On 2012-10-01 12:38, Michael Meeks wrote:
>> We could do some magic there; of course - space is a bit of an issue - 
>> we already pointlessly bloat bazillions of ascii strings into UCS-2 
>> (nominally UTF-16) representations and nail a ref-count and length on 
>> the beginning. If you turn on the lifecycle diagnostics in 
>> sal/rtl/source/strimp.hxx with the #ifdef and re-build sal, you can 
>> start to see the scale of the problem when you launch libreoffice ;-)
> 
> Changing subject because I'm changing the topic.
> 
> That was something I was thinking about the other day - given than the 
> bulk of our strings are pure 7-bit ASCII, it might be a worthwhile 
> optimisation to store a bit that says "this string is 7-bit ASCII", and 
> then store the string as a sequence of bytes.
> 
> The latest Java VM does this trick internally - it pretends that String 
> is stored with an array of 16-bit values, but actually it stores them as 
> UTF-8.

it does that?  impressive that they could dig their way out of the
utf-16 hole... but whatever they are doing won't be possible with our
OUStrings that directly expose the internal sal_Unicode array.

> Even in an app running in a language other than US-English, strings are 
> used for so many internal things that >90% of the strings are 7-bit ASCII.

space overhead is one problem with UTF16 strings, but there are other
problems as well: they are very error prone to use in an application
like LO that really must be 100% i18n-able: with UTF-16 it's all too
easy to write loops over the 16-bit code units without taking into
account the possibility that there are Unicode code points that are
actually represented by not one but two UTF-16 code units, leading to
real i18n bugs that are very difficult to detect because they only
happen with rather obscure languages; i.e. UTF-16 manages to combine the
size overhead of UCS-4 and variable length of UTF-8 into the worst of
both worlds.

with a UTF-8 string these i18n bugs would be very easy to detect since
they happen in pretty much every non-English language; you don't need to
be able to write Cuneiform to see the problem.  iteration should be done
with a dedicated method that returns the next code point as a int32_t.

also a UTF-8 string could be really constant: just write an ordinary
string literal in C++ and wrap a value class around it, no memory
allocation needed.

... which brings me to another point: in a hypothetical future when we
could efficiently create a UTF8String from a string literal in C++
without copying the darn thing, what should hypothetical operations to
mutate the string's buffer do?