Using a real parser generator to parse numbers (and dates)
Lionel Elie Mamane
lionel at mamane.lu
Wed Feb 29 02:49:38 PST 2012
In the context of fdo#41166 (a bug in how our numbers/date parser
handles ambiguities), I thought "wouldn't it be better if we used a
proper parser generator and wrote a proper grammar instead of a
hand-coded parser"? It would be easier to debug, easier to maintain,
easier to read, ... and thus probably less buggy in the long run.
Well, I've a good "proof of concept" start on making it a
Boost::Spirit grammar. I've chosen Boost::Spirit because:
1) It is a *dynamic* parser generator, that is the grammar can be
constructed at runtime. That's a requirement for us, for i18n
reasons.
2) It is in Boost, so no new external dependency: we already require
Boost.
The patch is attached so that you can get an idea. Alas, it is
polluted by the fact I converted everything to rtl::OUString from
tools::String. It is against libreoffice-3-5 simply because that's my
current main "dogfood / development" branch these days. I'm *not*
suggesting we do the change in libreoffice-3-5, only in master!
Highlights:
1) Yes, it works on the basis of UCS4 codepoints, so no problem with
UTF-16 surrogate pairs.
2) I add a const_iterator interface to rtl::OUString. Instead of
coding it manually to use iterateCodePoints, I again reuse the
power of boost. But I have in my git history a manually-coded
version based on iterateCodePoints if for some reason it is
desirable (I don't think it is).
That's also more generally a more "C++ natural" interface to
OUStrings, so we can use it throughout our codebase instead of
iterateCodePoints.
I don't define an iterator interface, as this could be *extremely*
slow: writing one character in the middle of a string is O(n)
(linear in the length of the string) when replacing a
surrogates-encoded codepoint by a non-surrogate encoded one and
vice-versa.
3) For now, it does not replace the old parser, but just does a
parallel parse and displays the result of the parse to stderr for
demonstration / debugging purposes.
4) Yes, this is heavily template / metaprogramming based and
compilation time is significant. But it is only this one file.
5) I've stuck the LibreOffice<->Spirit integration in
svl/source/numbers/spirit/, but possibly we could move it to a more
general location (sal/rtl?) so that it can be reused in other parts
of LibreOffice.
For now, I'm using only the "teach rtl::OUString to spirit" part of
the integration, so only that one is in the patch.
But there is also a "use the character classes notions of
<unotools/charclass.hxx> instead of the spirit ones", which I
attach as separate files. A "character class" is something like
"Alpha", "AlphaNumeric", "Letter", ... In particular, the notions
in unotools are language-dependent. However, I find the coverage
rather spotty. For example, there is no "is this character a
whitespace" predicate... So we would have to complement from the
native spirit notions anyway?
I'm not sure we will need it, since the current parser basically
does not use these character classifications at all: it only uses
toUpper to make things case-insensitive.
5) The grammar itself is in struct
ImpSvNumberInputScan::number_grammar. That's where you should
look.
Let's look for example at a simple, yet internationalised example,
booleans:
const ImpSvNumberformatScan* pFS = parent.pFormatter->GetFormatScanner();
::rtl::OUString sTrue (pFS->GetTrueString());
::rtl::OUString sFalse(pFS->GetTrueString());
rBoolean = (qi::lit(sTrue) >> qi::eoi) [ ref(fOutNumber) = 1, ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]
| (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1, ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ];
A boolean is a either a lit(eral) sTrue or a literal sFalse, and
sTrue and sFalse are runtime values. For sTrue, "1" is written in
fOutNumber, for sFalse, -1 is written. In Both cases,
NUMBERFORMAT_LOGICAL is written in eScannedType. Hmm... I could
have factorised that, as in:
rBoolean = ( (qi::lit(sTrue) >> qi::eoi) [ ref(fOutNumber) = 1 ]
| (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1 ] )
eoi [ ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]
For a more complicated example, see dates:
rDate = -dow >>
( ( eps [ reset_date(pCal) ] >> rISO8601 >> eoi )
| rPlainDate )
>> eoi [ ref(parent.eScannedType) = NUMBERFORMAT_DATE,
ref(fOutNumber) = make_date(parent.pNullDate, ref(pCal)) ];
A date is an optional day-of-week (the "-" means "optional")
followed by either a ISO8601 date or a (localised) "plain date".
A plain date is:
if ( true /* DMY format */ )
rPlainDate =
( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
| ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> -rDateSep >> rYear >> eoi );
else /* MDY format */
rPlainDate =
( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> -rDateSep >> rYear >> eoi );
That is, either a day-of-month, a month and a year or only two of
those, with "day-of-month and month" taking precedence over "month
and year" so that "5/3" is fifth of march (or third of may,
depending on language), not may of year 3 (or 2003).
In turn a month is either an unsigned short integer between 1 and
number of months in a year or the name of month:
const int nMonths = pCal->getNumberOfMonthsInYear();
rMonth = ushort_ [ _pass = _1 <= nMonths,
bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 - 1 ) ]
| month [ bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 ) ];
"_pass = _1 <= nMonths" means this rule should fail if the read
unsigned short is not <= nMonths. "month" is the rule for a "name
of month as a string":
for ( int i = 0; i < nMonths; ++i)
{
month.add(parent.pUpperMonthText[i], i)(parent.pUpperAbbrevMonthText[i], i);
}
Meaning $\forall i$, pUpperMonthText[i] and
pUpperAbbrevMonthText[i] map to month number i.
6) The grammar is constructed and invoked in IsNumberFormatMain:
std::cerr << "***** DBG before parse ****** r=(undefined)\tfOutNumber=" << fOutNumber << "\teScannedType=" << eScannedType << std::endl;
std::cerr << "To parse: '" << ::rtl::OUStringToOString( rString, RTL_TEXTENCODING_UTF8 ).getStr() << "'" << std::endl;
bool r = qi::phrase_parse(rString.begin(),
rString.end(),
number_grammar<rtl::OUString::const_iterator>(*this, fOutNumber),
enc::space);
std::cerr << "***** DBG after parse ****** r=" << r << "\tfOutNumber=" << fOutNumber << "\teScannedType=" << eScannedType << std::endl;
So, this is not feature-complete yet (e.g. does not recognise
fractions, mantissa/exponent notation such as "1.52225E14", dates are
not fully i18ned), and especially the date part will probably need
some cooperation / guidance from an i18n expert :)
But I think it shows we can do that, IMO to great long-term
advantage. Do I have some buy-in from you on that?
Questions you'll ask or should ask ;-)
- Hey, this constructs a grammar for each parse. We should do that
only when the grammar (language) changes.
Yes, in ChangeIntl. TODO.
P.S.: I'm in vacation from Fri 2 march 2012 to Sun 11 march 2012.
--
Lionel
More information about the LibreOffice
mailing list