[Libreoffice] [Crazy Ideas] Discuss

Mon Nov 29 16:34:31 PST 2010

On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:
>
>...
> However, looking at textsearch.cxx in Open Grok --
> http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
> --  can see this comment before the various types of calls to a search
> routine:
> // use transliteration here, but only if not RegEx, which does it different
>
> One can also see other exclusion of the regexp search algorithm from the
> transliteration search prep and search result code in textsearch.cxx around
> the calls to the search routines, but I'm not absolutely sure that exclusion
> is complete.  If the regexp search truly *never* uses transliteration then
> the swap out will be simpler and the change-over may actually enable
> transliteration.  I haven't looked at the internal code of the regexp -
> perhaps it 'does it's own thing' internally for transliteration...

Right. I have only a vague idea what "transliteration" means here. From 
a web search I can see that it must be an attempt to deal with things 
like accented characters (Is "a" the same as "ä", or not? Is "ss" the 
same as "ß"?), but I couldn't find any clear description of exactly what 
the transliteration was doing.

There is a letter-case filter applied to the text before a regex search, 
changing all characters to one single case, lower case for English text. 
If the user indicates that case is significant, the filter is not applied.

The actual searches get a text buffer and a pair of indices (first, 
last) indicating the region to search. The results are returned as a 
list of matches, also with indices into the text buffer. The code does a 
lot of adjusting of the indices, I suppose to account for 
character-level changes due to the transliteration, but again, I can't 
really tell what the adjustment code is supposed to do.

I was also having a lot of trouble learning anything from running OOo 
under gdb. Gdb was acting weird and I couldn't step through the code and 
poke around. I ended up trying to do it by adding a printf, rebuild, 
run, rinse, repeat. No fun; less progress.

My thought was maybe to just avoid all that and start out with an 
extension testbed that uses the Boost regexp. I'm sure I can get access 
to paragraphs of text without any transliteration or filtering, and see 
how well the Boost functions work. If that goes well, then move on to 
replacing code.

I think Boost looks like the way to go, since it has a lot of 
functionality, supports Unicode (16- or 32-bit chars), and OOo already 
uses it.

Performance could be a problem. I saw a comment in the code somewhere 
saying that performance is critical for some spreadsheets--I assume 
because Calc's lookups default to using regular expression matching.

As far as I can see, that's a faulty design, the lookups should not use 
regexp matching unless it is specifically requested, but it may be too 
late to change that now.

I've seen benchmarks indicating that the Boost regexp is fairly fast 
compared to other regexp engines, but I'm guessing that it's still 
slower than the current primitive engine.

<Joe