optimising OUString for space

Tue Oct 2 07:35:15 PDT 2012

On Mon, Oct 01, 2012 at 01:58:24PM +0200, Michael Stahl wrote:
> On 01/10/12 13:25, Michael Meeks wrote:
>> On Mon, 2012-10-01 at 13:02 +0200, Noel Grandin wrote:

>>> That was something I was thinking about the other day - given than
>>> the bulk of our strings are pure 7-bit ASCII, it might be a
>>> worthwhile optimisation to store a bit that says "this string is
>>> 7-bit ASCII", and then store the string as a sequence of bytes.

>> 	Optimisation ? :-) IMHO the ideal is to store all strings as UTF-8
>> underneath the hatches anyway.

>> 	The only problem with a change there is our ABI - which explicitly
>> exposes the encoding of that.

> of course this would only affect C++ binding (and possibly Python -- am
> not up to date how that does Unicode; there are differences between 2
> and 3 iirc; of course we should migrate to Python 3 as well...)

How the Python2 and Python 3.2 C ABIs deal with strings is ... a
compile-time option!  It can be UCS2 or UCS4. The actual type
(Py_UNICODE) can be a typedef for wchar_t, unsigned short or unsigned
long.
http://docs.python.org/c-api/unicode.html

Python 3.3 and later, on the other hand, switches between ASCII, UCS1,
UCS2 and UCS4 on the fly depending on the contents of this particular
string.
http://docs.python.org/py3k/c-api/unicode.html

-- 
Lionel