optimising OUString for space
mstahl at redhat.com
Mon Oct 1 04:58:24 PDT 2012
On 01/10/12 13:25, Michael Meeks wrote:
> On Mon, 2012-10-01 at 13:02 +0200, Noel Grandin wrote:
>> That was something I was thinking about the other day - given than the
>> bulk of our strings are pure 7-bit ASCII, it might be a worthwhile
>> optimisation to store a bit that says "this string is 7-bit ASCII", and
>> then store the string as a sequence of bytes.
> Optimisation ? :-) IMHO the ideal is to store all strings as UTF-8
> underneath the hatches anyway. All the people I've discussed this with
> that objected to that, turned out (after some discussion) to have a weak
> understanding of UTF-8, UTF-16 and of rendering complex text ;-) Of
> course, perhaps I should discuss with more people.
> The only problem with a change there is our ABI - which explicitly
> exposes the encoding of that.
the right time to do it is for LO4. sadly nobody has signed up for that
yet :( ... (while there are volunteers for far sillier proposals, like
getting rid of com.sun.star...)
of course this would only affect C++ binding (and possibly Python -- am
not up to date how that does Unicode; there are differences between 2
and 3 iirc; of course we should migrate to Python 3 as well...), while
Java binding still uses UTF-16 but i assume we have to copy strings
passed over the Java UNO bridge anyway.
>> The latest Java VM does this trick internally - it pretends that String
>> is stored with an array of 16-bit values, but actually it stores them as
> Interesting - for all strings ? is there a pointer to the code / docs
> for that detail somewhere ? :-) Last I looked Java also stored partial
i would expect they take advantage of JVM's tendency to generate code at
runtime to some non-trivial extent :)
> strings chained to it's parent; so 'substring' takes a reference on the
> parent (be it ever so large), and can return a single character string
> out of it without re-allocation. IIRC that can cause huge grief when
> parsing big files into little ones ;-)
that is a potential advantage of immutable string buffers that afaik we
don't take advantage of in LO so far.
More information about the LibreOffice