Using a real parser generator to parse numbers (and dates)

Wed Feb 29 02:49:38 PST 2012

In the context of fdo#41166 (a bug in how our numbers/date parser
handles ambiguities), I thought "wouldn't it be better if we used a
proper parser generator and wrote a proper grammar instead of a
hand-coded parser"? It would be easier to debug, easier to maintain,
easier to read, ... and thus probably less buggy in the long run.

Well, I've a good "proof of concept" start on making it a
Boost::Spirit grammar. I've chosen Boost::Spirit because:

1) It is a *dynamic* parser generator, that is the grammar can be
   constructed at runtime. That's a requirement for us, for i18n
   reasons.

2) It is in Boost, so no new external dependency: we already require
   Boost.

The patch is attached so that you can get an idea. Alas, it is
polluted by the fact I converted everything to rtl::OUString from
tools::String. It is against libreoffice-3-5 simply because that's my
current main "dogfood / development" branch these days. I'm *not*
suggesting we do the change in libreoffice-3-5, only in master!

Highlights:

1) Yes, it works on the basis of UCS4 codepoints, so no problem with
   UTF-16 surrogate pairs.

2) I add a const_iterator interface to rtl::OUString. Instead of
   coding it manually to use iterateCodePoints, I again reuse the
   power of boost. But I have in my git history a manually-coded
   version based on iterateCodePoints if for some reason it is
   desirable (I don't think it is).

   That's also more generally a more "C++ natural" interface to
   OUStrings, so we can use it throughout our codebase instead of
   iterateCodePoints.

   I don't define an iterator interface, as this could be *extremely*
   slow: writing one character in the middle of a string is O(n)
   (linear in the length of the string) when replacing a
   surrogates-encoded codepoint by a non-surrogate encoded one and
   vice-versa.

3) For now, it does not replace the old parser, but just does a
   parallel parse and displays the result of the parse to stderr for
   demonstration / debugging purposes.

4) Yes, this is heavily template / metaprogramming based and
   compilation time is significant. But it is only this one file.

5) I've stuck the LibreOffice<->Spirit integration in
   svl/source/numbers/spirit/, but possibly we could move it to a more
   general location (sal/rtl?) so that it can be reused in other parts
   of LibreOffice.

   For now, I'm using only the "teach rtl::OUString to spirit" part of
   the integration, so only that one is in the patch.

   But there is also a "use the character classes notions of
   <unotools/charclass.hxx> instead of the spirit ones", which I
   attach as separate files. A "character class" is something like
   "Alpha", "AlphaNumeric", "Letter", ... In particular, the notions
   in unotools are language-dependent. However, I find the coverage
   rather spotty. For example, there is no "is this character a
   whitespace" predicate... So we would have to complement from the
   native spirit notions anyway?

   I'm not sure we will need it, since the current parser basically
   does not use these character classifications at all: it only uses
   toUpper to make things case-insensitive.

5) The grammar itself is in struct
   ImpSvNumberInputScan::number_grammar. That's where you should
   look.

   Let's look for example at a simple, yet internationalised example,
   booleans:

   const ImpSvNumberformatScan* pFS = parent.pFormatter->GetFormatScanner();

   ::rtl::OUString sTrue (pFS->GetTrueString());
   ::rtl::OUString sFalse(pFS->GetTrueString());
   rBoolean =   (qi::lit(sTrue)  >> qi::eoi) [ ref(fOutNumber) =  1, ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]
              | (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1, ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ];

   A boolean is a either a lit(eral) sTrue or a literal sFalse, and
   sTrue and sFalse are runtime values. For sTrue, "1" is written in
   fOutNumber, for sFalse, -1 is written. In Both cases,
   NUMBERFORMAT_LOGICAL is written in eScannedType. Hmm... I could
   have factorised that, as in:

   rBoolean = (   (qi::lit(sTrue)  >> qi::eoi) [ ref(fOutNumber) =  1 ]
                | (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1 ] )
              eoi [ ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]

   For a more complicated example, see dates:

   rDate = -dow >>
       (   ( eps [ reset_date(pCal) ] >> rISO8601 >> eoi )
         | rPlainDate )
        >> eoi [ ref(parent.eScannedType) = NUMBERFORMAT_DATE,
                 ref(fOutNumber) =  make_date(parent.pNullDate, ref(pCal)) ];

   A date is an optional day-of-week (the "-" means "optional")
   followed by either a ISO8601 date or a (localised) "plain date".

   A plain date is:

   if ( true /* DMY format */ )
       rPlainDate =
             ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
           | ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth  >> -rDateSep >> rYear >> eoi );
   else /* MDY format */
       rPlainDate =
             ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> -rDateSep >> rYear >> eoi );

   That is, either a day-of-month, a month and a year or only two of
   those, with "day-of-month and month" taking precedence over "month
   and year" so that "5/3" is fifth of march (or third of may,
   depending on language), not may of year 3 (or 2003).

   In turn a month is either an unsigned short integer between 1 and
   number of months in a year or the name of month:

   const int nMonths =  pCal->getNumberOfMonthsInYear();
   rMonth = ushort_ [ _pass = _1 <= nMonths,
                      bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 - 1 ) ]
       | month      [ bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 ) ];

   "_pass = _1 <= nMonths" means this rule should fail if the read
   unsigned short is not <= nMonths. "month" is the rule for a "name
   of month as a string":

   for ( int i = 0; i < nMonths; ++i)
   {
       month.add(parent.pUpperMonthText[i], i)(parent.pUpperAbbrevMonthText[i], i);
   }

   Meaning $\forall i$, pUpperMonthText[i] and
   pUpperAbbrevMonthText[i] map to month number i.

6) The grammar is constructed and invoked in IsNumberFormatMain:

   std::cerr << "***** DBG before parse ****** r=(undefined)\tfOutNumber=" << fOutNumber << "\teScannedType=" << eScannedType << std::endl;
   std::cerr << "To parse: '" << ::rtl::OUStringToOString( rString, RTL_TEXTENCODING_UTF8 ).getStr() << "'" << std::endl;

   bool r = qi::phrase_parse(rString.begin(),
                             rString.end(),
                             number_grammar<rtl::OUString::const_iterator>(*this, fOutNumber),
                             enc::space);

   std::cerr << "***** DBG after  parse ****** r=" << r << "\tfOutNumber=" << fOutNumber << "\teScannedType=" << eScannedType << std::endl;

So, this is not feature-complete yet (e.g. does not recognise
fractions, mantissa/exponent notation such as "1.52225E14", dates are
not fully i18ned), and especially the date part will probably need
some cooperation / guidance from an i18n expert :)

But I think it shows we can do that, IMO to great long-term
advantage. Do I have some buy-in from you on that?

Questions you'll ask or should ask ;-)

 - Hey, this constructs a grammar for each parse. We should do that
   only when the grammar (language) changes.

   Yes, in ChangeIntl. TODO.

P.S.: I'm in vacation from Fri 2 march 2012 to Sun 11 march 2012.

-- 
Lionel