Norbert, thanks for the feedback. <div class="gmail_quote">2011/1/30 Norbert Thiebaud <<a href="mailto:nthiebaud@gmail.com">nthiebaud@gmail.com</a>> <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div><div></div><div class="h5">On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken <<a href="mailto:kenneth.venken@gmail.com">kenneth.venken@gmail.com</a>> wrote: > Hi, > > i came across gendict.cxx while fixing a possible memleak. It took me some > time to figure out what the code did. > I notice a lot of very very long function bodies in LO-code, gendict was no > exception. So i refactored the code, did some google searches on gendict and > was able to fix the memleak. > I ended up submitting only the fix, not the refactoring, because i didn't > want to break any de facto coding style guideliness and i am still a fairly > new contributer to LO. > > I submit these patches now, so you guys can decide if you push them or not. > In case it could save some other new contributor some time in understanding > gendict ;) > BTW, i tested the code on ja.dic and output is still the same. </div></div>Even though the program is not _that_ long, I think the intent is good, but I do have few questions, remarks: Preamble: - Some of the remarks below concern code-choice you made, some of them concern things that were already like that in the original code. I did not distinguished the two because a/ the remarks are on the resulting code, not on who wrote it nor when b/ since this is a complete re factoring, minimizing changes are not a priority; So, existing oddities should be fixed along the way. - This is by now means an exhaustive review. 1/ implement before use. I'd strongly prefer if you implemented-before-use rather than prototype + use + implementation. in other words: main() at the end and the function above it, in order such as they are implemented before they are used. </blockquote><div>Ok, i'll do it like that from now on. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 2/ keep private function private: in C that means qualify the print* and init* functions you created to break-down the big bad main function, as 'static'. </blockquote><div>Good point. </div><div> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 3/ patch 0002: you factor printFunctions() but I can't see where it is being called. ( oh, it 're-appear' in patch 0004, but still) </blockquote><div>yeah, sorry about that, i noticed it was gone after a few commits. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 4/ the indentation of the original was horrible, but if you are going to refactor, you may want to fix that. i.e + if (*u != current) { + if (*u < current) + printf("u %x, current %x, count %d, lenArray.size() %d\n", *u, current, + sal::static_int_cast<int>(count), sal::static_int_cast<int>(lenArray.size())); + current = *u; + charArray[current] = lenArray.size(); + } + if (*u != current) + { + if (*u < current) + printf("u %x, current %x, count %d, lenArray.size() %d\n", *u, current, + sal::static_int_cast<int>(count), sal::static_int_cast<int>(lenArray.size())); + current = *u; + charArray[current] = lenArray.size(); + } Note that *I* prefer systematic {} even for one-liner if/for/while, but that is not universally shared... so I won't insist on it. Note2: while we are at it the sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid. printf(" %d %d", count, lenArray.size()) works just fine. and btw there is no reason for count (or in general any local counter) to be a sal_int32. int is just fine) </blockquote><div>the if statement used here should probably be an ASSERT( *u < current ) because the input dictionary file should be a sorted list of words. if *u < current is true for the last word in the dictionary, the line charArray[current+1] = lenArray.size(); (after loop) could produce unintended behaviour. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 5/ patch 004: +void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray, + sal_Int32 count) +{ + fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t"); + count = 1; what's the point passing cont as a parameter if you are going to override it's value right away ? (note: ok so, patch 0005 actually fix that...) </blockquote><div>these patches should be viewed as a wholel. The refactoring was a process. But you're wright. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> + fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping 0 in index2 array. + for (size_t k = 0; k < lenArray.size(); k++) + { + fprintf(source_fp, "0x%lx, ", static_cast<long unsigned int>(lenArray[k])); lenArray, in the generated code, is declared as sal_int32[] so why casting it's element to long unsigned int ? + if (count == 0xf) + { + count = 0; + fprintf(source_fp, "\n\t"); + } + else count++; if( (k & 0xe) == 0xe) { fprintf(source_fp, "\n\t"); } achieve the same result without the need to maintain 'count' It is still a bit odd, due to the shift introduce by the special element 0. I 'd prefer + for (size_t k = 0; k < lenArray.size(); k++) + { + if (!(k &0xf) + { + fprintf(source_fp, "\n\t"); + } + fprintf(source_fp, "0x%lx, ", static_cast<long unsigned int>(lenArray[k])); + } that change a bit the output (by having the first special 0 entry standing out, but actually that is a good thing since it _is_ special +void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set) +{ + fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t"); + sal_Int32 prev = 0; + for (sal_Int32 i = 0; i < 0x100; i++) { + if (set[i] != 0xff) { + for (sal_Int32 j = 0; j < 0x100; j++) { + sal_Int32 k = (i*0x100) + j; the compiler should be smart enough to detect that multiply by 256 is left-shift by 8, but still why take a risk: k = (i << 8 | j); </blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> + if (prev != 0 && charArray[k] == 0) { + for (k++; k < 0x10000; k++) + if (charArray[k] != 0) + break; + } + prev = charArray[(i*0x100) + j]; + fprintf( + source_fp, "0x%lx, ", + sal::static_int_cast< unsigned long >( + k < 0x10000 ? charArray[k] + 1 : 0)); + if ((j+1) % 0x10 == 0) if (j & 0x0f == 0x0f) does the same thing without an addition AND a modulo operation, note that you are in the inner loop here. this will be called up to 65K times. granted that since we use fprintf() to push constant string to a file, performance is clearly not very high on the list of concern. fputs("\n\t", source_fp) is at least an order of magnitude faster than fprintf(source_cp, "\n\t"); +void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count) +{ + count = 0; + fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t"); + for (sal_Int32 i = 0; i < 0x1FFF; i++) { + sal_uInt8 bit = 0; + for (sal_Int32 j = 0; j < 8; j++) + if (exists[i * 8 + j]) + bit |= 1 << j; + fprintf(source_fp, "0x%02x, ", bit); + if (count == 0xf) { + count = 0; + fprintf(source_fp, "\n\t"); + } else count++; + } + fprintf (source_fp, "\n};\n"); } since exist[] is really an array of bool that is init to false and the set to true when needed, only to be then packed and dump as a bitfield, you could do static inline set_exist(int index) { exist[index>>3] |= 1 << (index & 0x07); } then replace the two place that set exist exist[u[0]] = sal_True; ->set_exist(u[0]); and exist[u[i]] = sal_True; -> set_exist(u[i]); now exist[] is then defined as static unsigned char exist[0x2000]; (note declaring it static means that you don't need to initialize it anymore since the C standard guarantee that it will be initialized with 0, much more efficiently than the loop that currently do that in the code, and the code is not meant to be re-entrant anyway) and you can simplify printExistMark in a simple byte dump. </blockquote><div>Nice. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 6/ patch 006 - if (argc < 3) exit(-1); + if (argc < 3) + { + printf("2 arguments required: dictionary_file_name source_file_name"); + exit(-1); + } The convention is to send error message to stderr. (and the output -- here a source file -- should default to stdout) </blockquote><div>Ok, i'll stick to that convention from now on. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 7/ patch 008 /* FIXME?: what happens if in every range i there is at least one charArray != 0 + => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff } + => then in index2, the last range will be ignored incorrectly */ luckily this is not possible since utf16 characters in the range D800-DBFF are not supported/valid see <a href="http://en.wikipedia.org/wiki/UTF-16/UCS-2" target="_blank">http://en.wikipedia.org/wiki/UTF-16/UCS-2</a> for a reason why. so you will have, at least, 4 ranges without any hit. </blockquote><div>So "It is not possible to encode these code points in UTF-16. The Unicode standard permanently reserves these values for UTF-16 encoding only, so a single 16-bit code unit in the range 0xD800 to 0xDFFF never represents a character in Plane 0." But since we are using the UTF-16 encoding, it could be possible that a sal_Unicode value (a unicode code unit) is in this range. See section about Code points U+10000..U+10FFFF in the link you sent me. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Norbert > -- Kenneth > > > > _______________________________________________ > LibreOffice mailing list > <a href="mailto:LibreOffice@lists.freedesktop.org">LibreOffice@lists.freedesktop.org</a> > <a href="http://lists.freedesktop.org/mailman/listinfo/libreoffice" target="_blank">http://lists.freedesktop.org/mailman/listinfo/libreoffice</a> > > </blockquote><div> So, should i make the changes to the code or have you already done them? -- Kenneth </div></div>