[Libreoffice] [PATCH] refactoring gendict

Norbert Thiebaud nthiebaud at gmail.com
Sun Jan 30 03:22:14 PST 2011


On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken
<kenneth.venken at gmail.com> wrote:
> Hi,
>
> i came across gendict.cxx while fixing a possible memleak. It took me some
> time to figure out what the code did.
> I notice a lot of very very long function bodies in LO-code, gendict was no
> exception. So i refactored the code, did some google searches on gendict and
> was able to fix the memleak.
> I ended up submitting only the fix, not the refactoring, because i didn't
> want to break any de facto coding style guideliness and i am still a fairly
> new contributer to LO.
>
> I submit these patches now, so you guys can decide if you push them or not.
> In case it could save some other new contributor some time in understanding
> gendict ;)
> BTW, i tested the code on ja.dic and output is still the same.

Even though the program is not _that_ long, I think the intent is
good, but I do have few questions, remarks:

Preamble:
- Some of the remarks below concern code-choice you made, some of them concern
things that were already like that in the original code. I did not
distinguished the two because
a/ the remarks are on the resulting code, not on who wrote it nor when
b/ since this is a complete re factoring, minimizing changes are not a
priority; So, existing
oddities should be fixed along the way.

- This is by now means an exhaustive review.

1/ implement before use.
I'd strongly prefer if you implemented-before-use rather than
prototype + use + implementation.
in other words: main() at the end and the function above it, in order
such as they are implemented before they are used.

2/ keep private function private: in C that means qualify the print*
and init* functions you created to break-down
the big bad main function, as 'static'.

3/ patch 0002: you factor printFunctions() but I can't see where it is
being called. ( oh, it 're-appear' in patch 0004, but still)

4/ the indentation of the original was horrible, but if you are going
to refactor, you may want to fix that.
i.e
+        if (*u != current) {
+        if (*u < current)
+        printf("u %x, current %x, count %d, lenArray.size() %d\n", *u, current,
+                    sal::static_int_cast<int>(count),
sal::static_int_cast<int>(lenArray.size()));
+        current = *u;
+        charArray[current] = lenArray.size();
+        }

+        if (*u != current)
+        {
+            if (*u < current)
+                printf("u %x, current %x, count %d, lenArray.size()
%d\n", *u, current,
+                         sal::static_int_cast<int>(count),
sal::static_int_cast<int>(lenArray.size()));
+            current = *u;
+            charArray[current] = lenArray.size();
+        }

Note that *I* prefer systematic {} even for one-liner if/for/while,
but that is not universally shared... so I won't insist on it.
Note2: while we are at it the
sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid.
printf(" %d %d", count, lenArray.size()) works just fine. and btw
there is no reason for count (or in general any local counter) to be a
sal_int32. int is just fine)

5/ patch 004:
+void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray,
+                   sal_Int32 count)
+{
+    fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t");
+    count = 1;

what's the point passing cont as a parameter if you are going to
override it's value right away ? (note: ok so, patch 0005 actually fix
that...)

+    fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping
0 in index2 array.
+    for (size_t k = 0; k < lenArray.size(); k++)
+    {
+        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
int>(lenArray[k]));

lenArray, in the generated code, is declared as sal_int32[] so why
casting it's element to long unsigned int ?

+        if (count == 0xf)
+        {
+            count = 0;
+            fprintf(source_fp, "\n\t");
+        }
+            else count++;
         if( (k & 0xe) == 0xe)
         {
            fprintf(source_fp, "\n\t");
         }
achieve the same result without the need to maintain 'count'
It is still a bit odd, due to the shift introduce by the special element 0.

I 'd prefer
+    for (size_t k = 0; k < lenArray.size(); k++)
+    {
+        if (!(k &0xf)
+        {
+            fprintf(source_fp, "\n\t");
+        }
+        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
int>(lenArray[k]));
+    }

that change a bit the output (by having the first special 0 entry
standing out, but actually that is a good thing since it _is_ special

+void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set)
+{
+    fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t");
+    sal_Int32 prev = 0;
+    for (sal_Int32 i = 0; i < 0x100; i++) {
+        if (set[i] != 0xff) {
+        for (sal_Int32 j = 0; j < 0x100; j++) {
+            sal_Int32 k = (i*0x100) + j;

the compiler should be smart enough to detect that multiply by 256 is
left-shift by 8, but still why take a risk:
             k = (i << 8 | j);

+            if (prev != 0 && charArray[k] == 0) {
+            for (k++; k < 0x10000; k++)
+                if (charArray[k] != 0)
+                break;
+            }
+            prev = charArray[(i*0x100) + j];
+            fprintf(
+                source_fp, "0x%lx, ",
+                sal::static_int_cast< unsigned long >(
+                    k < 0x10000 ? charArray[k] + 1 : 0));
+            if ((j+1) % 0x10 == 0)

             if (j & 0x0f == 0x0f)
does the same thing without an addition AND a modulo operation, note
that you are in the inner loop here. this will
be called up to 65K times. granted that since we use fprintf() to push
constant string to a file, performance is clearly not
very high on the list of concern.
fputs("\n\t", source_fp) is at least an order of magnitude faster than
fprintf(source_cp, "\n\t");


+void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count)
+{
+    count = 0;
+    fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t");
+    for (sal_Int32 i = 0; i < 0x1FFF; i++) {
+        sal_uInt8 bit = 0;
+        for (sal_Int32 j = 0; j < 8; j++)
+        if (exists[i * 8 + j])
+            bit |= 1 << j;
+        fprintf(source_fp, "0x%02x, ", bit);
+        if (count == 0xf) {
+        count = 0;
+        fprintf(source_fp, "\n\t");
+        } else count++;
+    }
+    fprintf (source_fp, "\n};\n");
 }
since exist[] is really an array of bool that is init to false and the
set to true when needed, only to be then packed and dump as a
bitfield, you could do

static inline set_exist(int index)
{
    exist[index>>3] |= 1 << (index & 0x07);
}

then replace the two place that set exist
exist[u[0]] = sal_True; ->set_exist(u[0]);
and
exist[u[i]] = sal_True; -> set_exist(u[i]);

now exist[] is then defined as
static unsigned char exist[0x2000];
(note declaring it static means that you don't need to initialize it
anymore since the C standard guarantee that it will be initialized
with 0, much more efficiently than the loop that currently do that in
the code, and the code is not meant to be re-entrant anyway)

and you can simplify printExistMark in a simple byte dump.

6/ patch 006

-    if (argc < 3) exit(-1);
+    if (argc < 3)
+    {
+        printf("2 arguments required: dictionary_file_name source_file_name");
+        exit(-1);
+    }

The convention is to send error message to stderr.
(and the output -- here a source file -- should default to stdout)

7/ patch 008
/* FIXME?: what happens if in every range i there is at least one charArray != 0
+       => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff }
+       => then in index2, the last range will be ignored incorrectly */

luckily this is not possible since utf16 characters in the range
D800-DBFF are not supported/valid
see http://en.wikipedia.org/wiki/UTF-16/UCS-2 for a reason why.

so you will have, at least, 4 ranges without any hit.

Norbert

> -- Kenneth
>
>
>
> _______________________________________________
> LibreOffice mailing list
> LibreOffice at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>
>


More information about the LibreOffice mailing list