[Libreoffice] [Crazy Ideas] Discuss
jes at martnet.com
Mon Nov 29 16:34:31 PST 2010
On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:
> However, looking at textsearch.cxx in Open Grok --
> -- can see this comment before the various types of calls to a search
> // use transliteration here, but only if not RegEx, which does it different
> One can also see other exclusion of the regexp search algorithm from the
> transliteration search prep and search result code in textsearch.cxx around
> the calls to the search routines, but I'm not absolutely sure that exclusion
> is complete. If the regexp search truly *never* uses transliteration then
> the swap out will be simpler and the change-over may actually enable
> transliteration. I haven't looked at the internal code of the regexp -
> perhaps it 'does it's own thing' internally for transliteration...
Right. I have only a vague idea what "transliteration" means here. From
a web search I can see that it must be an attempt to deal with things
like accented characters (Is "a" the same as "ä", or not? Is "ss" the
same as "ß"?), but I couldn't find any clear description of exactly what
the transliteration was doing.
There is a letter-case filter applied to the text before a regex search,
changing all characters to one single case, lower case for English text.
If the user indicates that case is significant, the filter is not applied.
The actual searches get a text buffer and a pair of indices (first,
last) indicating the region to search. The results are returned as a
list of matches, also with indices into the text buffer. The code does a
lot of adjusting of the indices, I suppose to account for
character-level changes due to the transliteration, but again, I can't
really tell what the adjustment code is supposed to do.
I was also having a lot of trouble learning anything from running OOo
under gdb. Gdb was acting weird and I couldn't step through the code and
poke around. I ended up trying to do it by adding a printf, rebuild,
run, rinse, repeat. No fun; less progress.
My thought was maybe to just avoid all that and start out with an
extension testbed that uses the Boost regexp. I'm sure I can get access
to paragraphs of text without any transliteration or filtering, and see
how well the Boost functions work. If that goes well, then move on to
I think Boost looks like the way to go, since it has a lot of
functionality, supports Unicode (16- or 32-bit chars), and OOo already
Performance could be a problem. I saw a comment in the code somewhere
saying that performance is critical for some spreadsheets--I assume
because Calc's lookups default to using regular expression matching.
As far as I can see, that's a faulty design, the lookups should not use
regexp matching unless it is specifically requested, but it may be too
late to change that now.
I've seen benchmarks indicating that the Boost regexp is fairly fast
compared to other regexp engines, but I'm guessing that it's still
slower than the current primitive engine.
More information about the LibreOffice