[Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

Wed Oct 27 14:42:53 PDT 2010

Using the following sample from a git patch one can see one way in which the
current counting method comes up with fewer words than other methods do.  
+1747,9
1.7.0.4
14 characters on two lines: either 2, 3 or 6 words depending on how you
count

Gedit says:  2 lines 6 words 15 chars 14 chars(no spaces)
LibOdev says: 2 words 14 chars 14 chars excl spaces  - (no stat line for
lines tho it has para counts)

Gedit takes each number as a word breaking the words on punctuation 
Gedit also counts the new line as whitespace
LibOdev counts all of any block of contiguous characters as a word 
LibOdev in node word counter never sees the newline

Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
gedit / LibOdev
Words: 2418 / 2414 
Chars: 24241 / 24241 
Chars – 16830 / 16830  (excl. spaces)
Now a near match in words and perfect match on chars excl spaces.  

Testing with a different entire patch file, the major difference is in words
1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
spaces agree completely 13 459 and 10 157
Taking into account the different word handling (top) and the way they match
then don't match I suspect a second difference in the counting method tween
gedit and LibOdev  and differences in the line breaks in the files after cut
and paste.  

So far gedit and LibOdev agree completely ONLY on the non-space counts.  

I didn't check results on your reference odt because gedit wont open odt and
cut and paste just dumps the XML into the text... 
Words      3997  /  18
Chars     33429  /  125 
Chars –  28469  /  107 
Where the second smaller numbers are a page footer's counts.  AFAIR -
LibOdev doesn't count the footer content and that might be the difference.
there are 20+ pages so thats 360+ words ~2500 chars in the footers

I also saw how the LibOdev count is zero at load of the odt.  Perhaps the
count is made somewhere else and saved on the doc without this code or it is
stored in the doc and loaded – either way the word count is  marked clean so
it is not re-counted when the dialog box calls updateStats and the excl.
spaces count remains zero.   Just clicking in the document causes a full
recount tho and that seems too busy  somehow.. <-- more than enough guessing
there....  

All these tests are with the aScanner.GetLen() > 1 check in place.  With
that Len >=2 check, the new counting routine has no problem with single
letter words like A, a, 1, -, or just ,   
It is puzzling that Mattias removed the check to handle single char words on
his machine but a build out of master/LibOdev works (at least for me) with
that same check in … 

I will test changing back to Mattias simpler submission.  (building now).  
I must note that the block immediately after this count area word counts the
outline numbers (and counts the bullets as words!?!) - it does not have any
such length check at all... I think all the len=1 strings that the scanner
might give back are just  CH_TXTATR_BREAKWORD = 0x01.  And they are probably
Scanner's zero length string.  Scanner's GetEnd points one slot past the end
of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen()    (no
-1 there)   And that end spot likely has a break marker.  

Again gedit and LibOdev agree completely ONLY on the non-space counts.  

-- 
View this message in context: http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1782965.html
Sent from the Dev mailing list archive at Nabble.com.