optimising OUString for space

Michael Stahl mstahl at redhat.com
Mon Oct 1 04:58:24 PDT 2012

On 01/10/12 13:25, Michael Meeks wrote:
> On Mon, 2012-10-01 at 13:02 +0200, Noel Grandin wrote:
>> That was something I was thinking about the other day - given than the 
>> bulk of our strings are pure 7-bit ASCII, it might be a worthwhile 
>> optimisation to store a bit that says "this string is 7-bit ASCII", and 
>> then store the string as a sequence of bytes.
> 	Optimisation ? :-) IMHO the ideal is to store all strings as UTF-8
> underneath the hatches anyway. All the people I've discussed this with
> that objected to that, turned out (after some discussion) to have a weak
> understanding of UTF-8, UTF-16 and of rendering complex text ;-) Of
> course, perhaps I should discuss with more people.
> 	The only problem with a change there is our ABI - which explicitly
> exposes the encoding of that.

the right time to do it is for LO4.  sadly nobody has signed up for that
yet :( ... (while there are volunteers for far sillier proposals, like
getting rid of com.sun.star...)

of course this would only affect C++ binding (and possibly Python -- am
not up to date how that does Unicode; there are differences between 2
and 3 iirc; of course we should migrate to Python 3 as well...), while
Java binding still uses UTF-16 but i assume we have to copy strings
passed over the Java UNO bridge anyway.

>> The latest Java VM does this trick internally - it pretends that String 
>> is stored with an array of 16-bit values, but actually it stores them as 
>> UTF-8.
> 	Interesting - for all strings ? is there a pointer to the code / docs
> for that detail somewhere ? :-) Last I looked Java also stored partial

i would expect they take advantage of JVM's tendency to generate code at
runtime to some non-trivial extent :)

> strings chained to it's parent; so 'substring' takes a reference on the
> parent (be it ever so large), and can return a single character string
> out of it without re-allocation. IIRC that can cause huge grief when
> parsing big files into little ones ;-)

that is a potential advantage of immutable string buffers that afaik we
don't take advantage of in LO so far.

More information about the LibreOffice mailing list