[Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Mattias Johnsson
m.t.johnsson at gmail.com
Wed Oct 27 15:37:26 PDT 2010
On 28 October 2010 08:42, LeMoyne <jlc at mail2lee.com> wrote:
>
> Using the following sample from a git patch one can see one way in which the
> current counting method comes up with fewer words than other methods do.
> +1747,9
> 1.7.0.4
> 14 characters on two lines: either 2, 3 or 6 words depending on how you
> count
>
> Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces)
> LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for
> lines tho it has para counts)
>
> Gedit takes each number as a word breaking the words on punctuation
> Gedit also counts the new line as whitespace
> LibOdev counts all of any block of contiguous characters as a word
> LibOdev in node word counter never sees the newline
>
> Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
> gedit / LibOdev
> Words: 2418 / 2414
> Chars: 24241 / 24241
> Chars – 16830 / 16830 (excl. spaces)
> Now a near match in words and perfect match on chars excl spaces.
>
> Testing with a different entire patch file, the major difference is in words
> 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
> spaces agree completely 13 459 and 10 157
> Taking into account the different word handling (top) and the way they match
> then don't match I suspect a second difference in the counting method tween
> gedit and LibOdev and differences in the line breaks in the files after cut
> and paste.
>
> So far gedit and LibOdev agree completely ONLY on the non-space counts.
>
> I didn't check results on your reference odt because gedit wont open odt and
> cut and paste just dumps the XML into the text...
> Words 3997 / 18
> Chars 33429 / 125
> Chars – 28469 / 107
> Where the second smaller numbers are a page footer's counts. AFAIR -
> LibOdev doesn't count the footer content and that might be the difference.
> there are 20+ pages so thats 360+ words ~2500 chars in the footers
>
> I also saw how the LibOdev count is zero at load of the odt. Perhaps the
> count is made somewhere else and saved on the doc without this code or it is
> stored in the doc and loaded – either way the word count is marked clean so
> it is not re-counted when the dialog box calls updateStats and the excl.
> spaces count remains zero. Just clicking in the document causes a full
> recount tho and that seems too busy somehow.. <-- more than enough guessing
> there....
>
> All these tests are with the aScanner.GetLen() > 1 check in place. With
> that Len >=2 check, the new counting routine has no problem with single
> letter words like A, a, 1, -, or just ,
> It is puzzling that Mattias removed the check to handle single char words on
> his machine but a build out of master/LibOdev works (at least for me) with
> that same check in …
Hmm, I originally left that check in because it was in Norbert's
sketch code, and I figured he knew what was going on. But I definitely
didn't get the right word count with it in place, and I did when I
removed it. I was quite puzzled as to its purpose - your explanation
about the leading spaces and the SwScanner makes sense, though, and I
guess that's the reason it was there.
> I will test changing back to Mattias simpler submission. (building now).
> I must note that the block immediately after this count area word counts the
> outline numbers (and counts the bullets as words!?!) - it does not have any
> such length check at all... I think all the len=1 strings that the scanner
> might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably
> Scanner's zero length string. Scanner's GetEnd points one slot past the end
> of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen() (no
> -1 there) And that end spot likely has a break marker.
>
> Again gedit and LibOdev agree completely ONLY on the non-space counts.
Nice analysis! I'm at work now, but with your explanations I'll look
into things again when I get home, unless you've solved all the
problems by then.
I did notice the problem LO has with counting things like isolated
punctuation as a word (and its deliberate choice to count bullets as
words), but decided not to try and change it, since I figured step 1
was to add the feature without breaking the current behaviour :-P I
also couldn't see a way to make it robust for all languages,
especially those with non-Latin alphabets and weird punctuation
markers.
More information about the LibreOffice
mailing list