Norbert,<br><br>thanks for the feedback. <br><br><div class="gmail_quote">2011/1/30 Norbert Thiebaud <span dir="ltr"><<a href="mailto:nthiebaud@gmail.com">nthiebaud@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div><div></div><div class="h5">On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken<br>
<<a href="mailto:kenneth.venken@gmail.com">kenneth.venken@gmail.com</a>> wrote:<br>
> Hi,<br>
><br>
> i came across gendict.cxx while fixing a possible memleak. It took me some<br>
> time to figure out what the code did.<br>
> I notice a lot of very very long function bodies in LO-code, gendict was no<br>
> exception. So i refactored the code, did some google searches on gendict and<br>
> was able to fix the memleak.<br>
> I ended up submitting only the fix, not the refactoring, because i didn't<br>
> want to break any de facto coding style guideliness and i am still a fairly<br>
> new contributer to LO.<br>
><br>
> I submit these patches now, so you guys can decide if you push them or not.<br>
> In case it could save some other new contributor some time in understanding<br>
> gendict ;)<br>
> BTW, i tested the code on ja.dic and output is still the same.<br>
<br>
</div></div>Even though the program is not _that_ long, I think the intent is<br>
good, but I do have few questions, remarks:<br>
<br>
Preamble:<br>
- Some of the remarks below concern code-choice you made, some of them concern<br>
things that were already like that in the original code. I did not<br>
distinguished the two because<br>
a/ the remarks are on the resulting code, not on who wrote it nor when<br>
b/ since this is a complete re factoring, minimizing changes are not a<br>
priority; So, existing<br>
oddities should be fixed along the way.<br>
<br>
- This is by now means an exhaustive review.<br>
<br>
1/ implement before use.<br>
I'd strongly prefer if you implemented-before-use rather than<br>
prototype + use + implementation.<br>
in other words: main() at the end and the function above it, in order<br>
such as they are implemented before they are used. <br></blockquote><div>Ok, i'll do it like that from now on. <br> <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
2/ keep private function private: in C that means qualify the print*<br>
and init* functions you created to break-down<br>
the big bad main function, as 'static'. </blockquote><div>Good point. <br></div><div><br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
3/ patch 0002: you factor printFunctions() but I can't see where it is<br>
being called. ( oh, it 're-appear' in patch 0004, but still)<br>
<br></blockquote><div>yeah, sorry about that, i noticed it was gone after a few commits. <br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
4/ the indentation of the original was horrible, but if you are going<br>
to refactor, you may want to fix that.<br>
i.e<br>
+ if (*u != current) {<br>
+ if (*u < current)<br>
+ printf("u %x, current %x, count %d, lenArray.size() %d\n", *u, current,<br>
+ sal::static_int_cast<int>(count),<br>
sal::static_int_cast<int>(lenArray.size()));<br>
+ current = *u;<br>
+ charArray[current] = lenArray.size();<br>
+ }<br>
<br>
+ if (*u != current)<br>
+ {<br>
+ if (*u < current)<br>
+ printf("u %x, current %x, count %d, lenArray.size()<br>
%d\n", *u, current,<br>
+ sal::static_int_cast<int>(count),<br>
sal::static_int_cast<int>(lenArray.size()));<br>
+ current = *u;<br>
+ charArray[current] = lenArray.size();<br>
+ }<br>
<br>
Note that *I* prefer systematic {} even for one-liner if/for/while,<br>
but that is not universally shared... so I won't insist on it.<br>
Note2: while we are at it the<br>
sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid.<br>
printf(" %d %d", count, lenArray.size()) works just fine. and btw<br>
there is no reason for count (or in general any local counter) to be a<br>
sal_int32. int is just fine)<br></blockquote><div>the if statement used here should probably be an ASSERT( *u < current ) because the input dictionary file should be a sorted list of words.<br>if *u < current is true for the last word in the dictionary, the line charArray[current+1] = lenArray.size(); (after loop) could produce unintended behaviour.<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
5/ patch 004:<br>
+void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray,<br>
+ sal_Int32 count)<br>
+{<br>
+ fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t");<br>
+ count = 1;<br>
<br>
what's the point passing cont as a parameter if you are going to<br>
override it's value right away ? (note: ok so, patch 0005 actually fix<br>
that...)<br></blockquote><div>these patches should be viewed as a wholel. The refactoring was a process. But you're wright. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
+ fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping<br>
0 in index2 array.<br>
+ for (size_t k = 0; k < lenArray.size(); k++)<br>
+ {<br>
+ fprintf(source_fp, "0x%lx, ", static_cast<long unsigned<br>
int>(lenArray[k]));<br>
<br>
lenArray, in the generated code, is declared as sal_int32[] so why<br>
casting it's element to long unsigned int ?<br>
<br>
+ if (count == 0xf)<br>
+ {<br>
+ count = 0;<br>
+ fprintf(source_fp, "\n\t");<br>
+ }<br>
+ else count++;<br>
if( (k & 0xe) == 0xe)<br>
{<br>
fprintf(source_fp, "\n\t");<br>
}<br>
achieve the same result without the need to maintain 'count'<br>
It is still a bit odd, due to the shift introduce by the special element 0.<br>
<br>
I 'd prefer<br>
+ for (size_t k = 0; k < lenArray.size(); k++)<br>
+ {<br>
+ if (!(k &0xf)<br>
+ {<br>
+ fprintf(source_fp, "\n\t");<br>
+ }<br>
+ fprintf(source_fp, "0x%lx, ", static_cast<long unsigned<br>
int>(lenArray[k]));<br>
+ }<br>
<br>
that change a bit the output (by having the first special 0 entry<br>
standing out, but actually that is a good thing since it _is_ special<br>
<br>
+void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set)<br>
+{<br>
+ fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t");<br>
+ sal_Int32 prev = 0;<br>
+ for (sal_Int32 i = 0; i < 0x100; i++) {<br>
+ if (set[i] != 0xff) {<br>
+ for (sal_Int32 j = 0; j < 0x100; j++) {<br>
+ sal_Int32 k = (i*0x100) + j;<br>
<br>
the compiler should be smart enough to detect that multiply by 256 is<br>
left-shift by 8, but still why take a risk:<br>
k = (i << 8 | j); </blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
+ if (prev != 0 && charArray[k] == 0) {<br>
+ for (k++; k < 0x10000; k++)<br>
+ if (charArray[k] != 0)<br>
+ break;<br>
+ }<br>
+ prev = charArray[(i*0x100) + j];<br>
+ fprintf(<br>
+ source_fp, "0x%lx, ",<br>
+ sal::static_int_cast< unsigned long >(<br>
+ k < 0x10000 ? charArray[k] + 1 : 0));<br>
+ if ((j+1) % 0x10 == 0)<br>
<br>
if (j & 0x0f == 0x0f)<br>
does the same thing without an addition AND a modulo operation, note<br>
that you are in the inner loop here. this will<br>
be called up to 65K times. granted that since we use fprintf() to push<br>
constant string to a file, performance is clearly not<br>
very high on the list of concern.<br>
fputs("\n\t", source_fp) is at least an order of magnitude faster than<br>
fprintf(source_cp, "\n\t");<br>
<br>
<br>
+void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count)<br>
+{<br>
+ count = 0;<br>
+ fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t");<br>
+ for (sal_Int32 i = 0; i < 0x1FFF; i++) {<br>
+ sal_uInt8 bit = 0;<br>
+ for (sal_Int32 j = 0; j < 8; j++)<br>
+ if (exists[i * 8 + j])<br>
+ bit |= 1 << j;<br>
+ fprintf(source_fp, "0x%02x, ", bit);<br>
+ if (count == 0xf) {<br>
+ count = 0;<br>
+ fprintf(source_fp, "\n\t");<br>
+ } else count++;<br>
+ }<br>
+ fprintf (source_fp, "\n};\n");<br>
}<br>
since exist[] is really an array of bool that is init to false and the<br>
set to true when needed, only to be then packed and dump as a<br>
bitfield, you could do<br>
<br>
static inline set_exist(int index)<br>
{<br>
exist[index>>3] |= 1 << (index & 0x07);<br>
}<br>
<br>
then replace the two place that set exist<br>
exist[u[0]] = sal_True; ->set_exist(u[0]);<br>
and<br>
exist[u[i]] = sal_True; -> set_exist(u[i]);<br>
<br>
now exist[] is then defined as<br>
static unsigned char exist[0x2000];<br>
(note declaring it static means that you don't need to initialize it<br>
anymore since the C standard guarantee that it will be initialized<br>
with 0, much more efficiently than the loop that currently do that in<br>
the code, and the code is not meant to be re-entrant anyway)<br>
<br>
and you can simplify printExistMark in a simple byte dump.<br></blockquote><div>Nice. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
6/ patch 006<br>
<br>
- if (argc < 3) exit(-1);<br>
+ if (argc < 3)<br>
+ {<br>
+ printf("2 arguments required: dictionary_file_name source_file_name");<br>
+ exit(-1);<br>
+ }<br>
<br>
The convention is to send error message to stderr.<br>
(and the output -- here a source file -- should default to stdout)<br>
<br></blockquote><div>Ok, i'll stick to that convention from now on. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
7/ patch 008<br>
/* FIXME?: what happens if in every range i there is at least one charArray != 0<br>
+ => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff }<br>
+ => then in index2, the last range will be ignored incorrectly */<br>
<br>
luckily this is not possible since utf16 characters in the range<br>
D800-DBFF are not supported/valid<br>
see <a href="http://en.wikipedia.org/wiki/UTF-16/UCS-2" target="_blank">http://en.wikipedia.org/wiki/UTF-16/UCS-2</a> for a reason why.<br>
<br>
so you will have, at least, 4 ranges without any hit.<br>
<br></blockquote><div>So "It is not possible to encode these code points in UTF-16. The Unicode
standard permanently reserves these values for UTF-16 encoding only, so a
single 16-bit code unit in the range 0xD800 to 0xDFFF never represents a
character in Plane 0."<br>But since we are using the UTF-16 encoding, it could be possible that a sal_Unicode value (a unicode code unit) is in this range.<br>See section about<span class="mw-headline" id="Code_points_U.2B10000..U.2B10FFFF"> Code points U+10000..U+10FFFF</span> in the link you sent me.<br>
<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Norbert<br>
<br>
> -- Kenneth<br>
><br>
><br>
><br>
> _______________________________________________<br>
> LibreOffice mailing list<br>
> <a href="mailto:LibreOffice@lists.freedesktop.org">LibreOffice@lists.freedesktop.org</a><br>
> <a href="http://lists.freedesktop.org/mailman/listinfo/libreoffice" target="_blank">http://lists.freedesktop.org/mailman/listinfo/libreoffice</a><br>
><br>
><br></blockquote><div><br>So, should i make the changes to the code or have you already done them?<br><br>-- Kenneth <br></div></div>