[Libreoffice] [PATCH] refactoring gendict

Kenneth Venken kenneth.venken at gmail.com
Sun Jan 30 05:35:02 PST 2011


Norbert,

thanks for the feedback.

2011/1/30 Norbert Thiebaud <nthiebaud at gmail.com>

> On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken
> <kenneth.venken at gmail.com> wrote:
> > Hi,
> >
> > i came across gendict.cxx while fixing a possible memleak. It took me
> some
> > time to figure out what the code did.
> > I notice a lot of very very long function bodies in LO-code, gendict was
> no
> > exception. So i refactored the code, did some google searches on gendict
> and
> > was able to fix the memleak.
> > I ended up submitting only the fix, not the refactoring, because i didn't
> > want to break any de facto coding style guideliness and i am still a
> fairly
> > new contributer to LO.
> >
> > I submit these patches now, so you guys can decide if you push them or
> not.
> > In case it could save some other new contributor some time in
> understanding
> > gendict ;)
> > BTW, i tested the code on ja.dic and output is still the same.
>
> Even though the program is not _that_ long, I think the intent is
> good, but I do have few questions, remarks:
>
> Preamble:
> - Some of the remarks below concern code-choice you made, some of them
> concern
> things that were already like that in the original code. I did not
> distinguished the two because
> a/ the remarks are on the resulting code, not on who wrote it nor when
> b/ since this is a complete re factoring, minimizing changes are not a
> priority; So, existing
> oddities should be fixed along the way.
>
> - This is by now means an exhaustive review.
>
> 1/ implement before use.
> I'd strongly prefer if you implemented-before-use rather than
> prototype + use + implementation.
> in other words: main() at the end and the function above it, in order
> such as they are implemented before they are used.
>
Ok, i'll do it like that from now on.


> 2/ keep private function private: in C that means qualify the print*
> and init* functions you created to break-down
> the big bad main function, as 'static'.

Good point.

3/ patch 0002: you factor printFunctions() but I can't see where it is
> being called. ( oh, it 're-appear' in patch 0004, but still)
>
> yeah, sorry about that, i noticed it was gone after a few commits.


> 4/ the indentation of the original was horrible, but if you are going
> to refactor, you may want to fix that.
> i.e
> +        if (*u != current) {
> +        if (*u < current)
> +        printf("u %x, current %x, count %d, lenArray.size() %d\n", *u,
> current,
> +                    sal::static_int_cast<int>(count),
> sal::static_int_cast<int>(lenArray.size()));
> +        current = *u;
> +        charArray[current] = lenArray.size();
> +        }
>
> +        if (*u != current)
> +        {
> +            if (*u < current)
> +                printf("u %x, current %x, count %d, lenArray.size()
> %d\n", *u, current,
> +                         sal::static_int_cast<int>(count),
> sal::static_int_cast<int>(lenArray.size()));
> +            current = *u;
> +            charArray[current] = lenArray.size();
> +        }
>
> Note that *I* prefer systematic {} even for one-liner if/for/while,
> but that is not universally shared... so I won't insist on it.
> Note2: while we are at it the
> sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid.
> printf(" %d %d", count, lenArray.size()) works just fine. and btw
> there is no reason for count (or in general any local counter) to be a
> sal_int32. int is just fine)
>
the if statement used here should probably be an ASSERT( *u < current )
because the input dictionary file should be a sorted list of words.
if *u < current is true for the last word in the dictionary, the line
charArray[current+1] = lenArray.size(); (after loop) could produce
unintended behaviour.

>
> 5/ patch 004:
> +void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray,
> +                   sal_Int32 count)
> +{
> +    fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t");
> +    count = 1;
>
> what's the point passing cont as a parameter if you are going to
> override it's value right away ? (note: ok so, patch 0005 actually fix
> that...)
>
these patches should be viewed as a wholel. The refactoring was a process.
But you're wright.

>
> +    fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping
> 0 in index2 array.
> +    for (size_t k = 0; k < lenArray.size(); k++)
> +    {
> +        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
> int>(lenArray[k]));
>
> lenArray, in the generated code, is declared as sal_int32[] so why
> casting it's element to long unsigned int ?
>
> +        if (count == 0xf)
> +        {
> +            count = 0;
> +            fprintf(source_fp, "\n\t");
> +        }
> +            else count++;
>         if( (k & 0xe) == 0xe)
>         {
>            fprintf(source_fp, "\n\t");
>         }
> achieve the same result without the need to maintain 'count'
> It is still a bit odd, due to the shift introduce by the special element 0.
>
> I 'd prefer
> +    for (size_t k = 0; k < lenArray.size(); k++)
> +    {
> +        if (!(k &0xf)
> +        {
> +            fprintf(source_fp, "\n\t");
> +        }
> +        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
> int>(lenArray[k]));
> +    }
>
> that change a bit the output (by having the first special 0 entry
> standing out, but actually that is a good thing since it _is_ special
>
> +void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set)
> +{
> +    fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t");
> +    sal_Int32 prev = 0;
> +    for (sal_Int32 i = 0; i < 0x100; i++) {
> +        if (set[i] != 0xff) {
> +        for (sal_Int32 j = 0; j < 0x100; j++) {
> +            sal_Int32 k = (i*0x100) + j;
>
> the compiler should be smart enough to detect that multiply by 256 is
> left-shift by 8, but still why take a risk:
>             k = (i << 8 | j);


> +            if (prev != 0 && charArray[k] == 0) {
> +            for (k++; k < 0x10000; k++)
> +                if (charArray[k] != 0)
> +                break;
> +            }
> +            prev = charArray[(i*0x100) + j];
> +            fprintf(
> +                source_fp, "0x%lx, ",
> +                sal::static_int_cast< unsigned long >(
> +                    k < 0x10000 ? charArray[k] + 1 : 0));
> +            if ((j+1) % 0x10 == 0)
>
>             if (j & 0x0f == 0x0f)
> does the same thing without an addition AND a modulo operation, note
> that you are in the inner loop here. this will
> be called up to 65K times. granted that since we use fprintf() to push
> constant string to a file, performance is clearly not
> very high on the list of concern.
> fputs("\n\t", source_fp) is at least an order of magnitude faster than
> fprintf(source_cp, "\n\t");
>
>
> +void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count)
> +{
> +    count = 0;
> +    fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t");
> +    for (sal_Int32 i = 0; i < 0x1FFF; i++) {
> +        sal_uInt8 bit = 0;
> +        for (sal_Int32 j = 0; j < 8; j++)
> +        if (exists[i * 8 + j])
> +            bit |= 1 << j;
> +        fprintf(source_fp, "0x%02x, ", bit);
> +        if (count == 0xf) {
> +        count = 0;
> +        fprintf(source_fp, "\n\t");
> +        } else count++;
> +    }
> +    fprintf (source_fp, "\n};\n");
>  }
> since exist[] is really an array of bool that is init to false and the
> set to true when needed, only to be then packed and dump as a
> bitfield, you could do
>
> static inline set_exist(int index)
> {
>    exist[index>>3] |= 1 << (index & 0x07);
> }
>
> then replace the two place that set exist
> exist[u[0]] = sal_True; ->set_exist(u[0]);
> and
> exist[u[i]] = sal_True; -> set_exist(u[i]);
>
> now exist[] is then defined as
> static unsigned char exist[0x2000];
> (note declaring it static means that you don't need to initialize it
> anymore since the C standard guarantee that it will be initialized
> with 0, much more efficiently than the loop that currently do that in
> the code, and the code is not meant to be re-entrant anyway)
>
> and you can simplify printExistMark in a simple byte dump.
>
Nice.

>
> 6/ patch 006
>
> -    if (argc < 3) exit(-1);
> +    if (argc < 3)
> +    {
> +        printf("2 arguments required: dictionary_file_name
> source_file_name");
> +        exit(-1);
> +    }
>
> The convention is to send error message to stderr.
> (and the output -- here a source file -- should default to stdout)
>
> Ok, i'll stick to that convention from now on.

> 7/ patch 008
> /* FIXME?: what happens if in every range i there is at least one charArray
> != 0
> +       => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff }
> +       => then in index2, the last range will be ignored incorrectly */
>
> luckily this is not possible since utf16 characters in the range
> D800-DBFF are not supported/valid
> see http://en.wikipedia.org/wiki/UTF-16/UCS-2 for a reason why.
>
> so you will have, at least, 4 ranges without any hit.
>
> So "It is not possible to encode these code points in UTF-16. The Unicode
standard permanently reserves these values for UTF-16 encoding only, so a
single 16-bit code unit in the range 0xD800 to 0xDFFF never represents a
character in Plane 0."
But since we are using the UTF-16 encoding, it could be possible that a
sal_Unicode value (a unicode code unit)  is in this range.
See section about Code points U+10000..U+10FFFF in the link you sent me.

Norbert
>
> > -- Kenneth
> >
> >
> >
> > _______________________________________________
> > LibreOffice mailing list
> > LibreOffice at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/libreoffice
> >
> >
>

So, should i make the changes to the code or have you already done them?

-- Kenneth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20110130/02e7f88f/attachment-0001.html>


More information about the LibreOffice mailing list