[Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

Wed Oct 27 15:37:26 PDT 2010

On 28 October 2010 08:42, LeMoyne <jlc at mail2lee.com> wrote:
>
> Using the following sample from a git patch one can see one way in which the
> current counting method comes up with fewer words than other methods do.
> +1747,9
> 1.7.0.4
> 14 characters on two lines: either 2, 3 or 6 words depending on how you
> count
>
> Gedit says:  2 lines 6 words 15 chars 14 chars(no spaces)
> LibOdev says: 2 words 14 chars 14 chars excl spaces  - (no stat line for
> lines tho it has para counts)
>
> Gedit takes each number as a word breaking the words on punctuation
> Gedit also counts the new line as whitespace
> LibOdev counts all of any block of contiguous characters as a word
> LibOdev in node word counter never sees the newline
>
> Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
> gedit / LibOdev
> Words: 2418 / 2414
> Chars: 24241 / 24241
> Chars – 16830 / 16830  (excl. spaces)
> Now a near match in words and perfect match on chars excl spaces.
>
> Testing with a different entire patch file, the major difference is in words
> 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
> spaces agree completely 13 459 and 10 157
> Taking into account the different word handling (top) and the way they match
> then don't match I suspect a second difference in the counting method tween
> gedit and LibOdev  and differences in the line breaks in the files after cut
> and paste.
>
> So far gedit and LibOdev agree completely ONLY on the non-space counts.
>
> I didn't check results on your reference odt because gedit wont open odt and
> cut and paste just dumps the XML into the text...
> Words      3997  /  18
> Chars     33429  /  125
> Chars –  28469  /  107
> Where the second smaller numbers are a page footer's counts.  AFAIR -
> LibOdev doesn't count the footer content and that might be the difference.
> there are 20+ pages so thats 360+ words ~2500 chars in the footers
>
> I also saw how the LibOdev count is zero at load of the odt.  Perhaps the
> count is made somewhere else and saved on the doc without this code or it is
> stored in the doc and loaded – either way the word count is  marked clean so
> it is not re-counted when the dialog box calls updateStats and the excl.
> spaces count remains zero.   Just clicking in the document causes a full
> recount tho and that seems too busy  somehow.. <-- more than enough guessing
> there....
>
> All these tests are with the aScanner.GetLen() > 1 check in place.  With
> that Len >=2 check, the new counting routine has no problem with single
> letter words like A, a, 1, -, or just ,
> It is puzzling that Mattias removed the check to handle single char words on
> his machine but a build out of master/LibOdev works (at least for me) with
> that same check in …

Hmm, I originally left that check in because it was in Norbert's
sketch code, and I figured he knew what was going on. But I definitely
didn't get the right word count with it in place, and I did when I
removed it. I was quite puzzled as to its purpose - your explanation
about the leading spaces and the SwScanner makes sense, though, and I
guess that's the reason it was there.

> I will test changing back to Mattias simpler submission.  (building now).
> I must note that the block immediately after this count area word counts the
> outline numbers (and counts the bullets as words!?!) - it does not have any
> such length check at all... I think all the len=1 strings that the scanner
> might give back are just  CH_TXTATR_BREAKWORD = 0x01.  And they are probably
> Scanner's zero length string.  Scanner's GetEnd points one slot past the end
> of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen()    (no
> -1 there)   And that end spot likely has a break marker.
>
> Again gedit and LibOdev agree completely ONLY on the non-space counts.

Nice analysis! I'm at work now, but with your explanations I'll look
into things again when I get home, unless you've solved all the
problems by then.

I did notice the problem LO has with counting things like isolated
punctuation as a word (and its deliberate choice to count bullets as
words), but decided not to try and change it, since I figured step 1
was to add the feature without breaking the current behaviour :-P I
also couldn't see a way to make it robust for all languages,
especially those with non-Latin alphabets and weird punctuation
markers.