optimising OUString for space

Michael Meeks michael.meeks at suse.com
Mon Oct 1 04:25:04 PDT 2012


On Mon, 2012-10-01 at 13:02 +0200, Noel Grandin wrote:
> That was something I was thinking about the other day - given than the 
> bulk of our strings are pure 7-bit ASCII, it might be a worthwhile 
> optimisation to store a bit that says "this string is 7-bit ASCII", and 
> then store the string as a sequence of bytes.

	Optimisation ? :-) IMHO the ideal is to store all strings as UTF-8
underneath the hatches anyway. All the people I've discussed this with
that objected to that, turned out (after some discussion) to have a weak
understanding of UTF-8, UTF-16 and of rendering complex text ;-) Of
course, perhaps I should discuss with more people.

	The only problem with a change there is our ABI - which explicitly
exposes the encoding of that.

> The latest Java VM does this trick internally - it pretends that String 
> is stored with an array of 16-bit values, but actually it stores them as 
> UTF-8.

	Interesting - for all strings ? is there a pointer to the code / docs
for that detail somewhere ? :-) Last I looked Java also stored partial
strings chained to it's parent; so 'substring' takes a reference on the
parent (be it ever so large), and can return a single character string
out of it without re-allocation. IIRC that can cause huge grief when
parsing big files into little ones ;-)

> Even in an app running in a language other than US-English, strings are 
> used for so many internal things that >90% of the strings are 7-bit ASCII.

	Sure - so define the define, see what it prints, and do the quick
calculation of how much time/space we save by doing it :-)

	Then again - last I looked we still had some real dumbness that needed
hunting down relating to many (tens of?) thousands of allocations and
frees of the "/" string at startup ;-)

	ATB,

		Michael.

-- 
michael.meeks at suse.com  <><, Pseudo Engineer, itinerant idiot



More information about the LibreOffice mailing list