[Libreoffice-bugs] [Bug 125110] CalcSpreadsheet: issues converting .CSV where there are more than 30K rows of data

bugzilla-daemon at bugs.documentfoundation.org bugzilla-daemon at bugs.documentfoundation.org
Sun May 5 18:42:41 UTC 2019


https://bugs.documentfoundation.org/show_bug.cgi?id=125110

--- Comment #7 from Mike Kaganski <mikekaganski at hotmail.com> ---
(In reply to Julien Nabet from comment #4)
> Let's take the original line for example:
> 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False

Let me describe my idea about what happens now, and what should happen instead.

IIUC, when LO sees string separator (double quote) in the beginning of a field
(start of line, or right after field separator), it enters "quote-enclosed
field" mode. It expects the field to end with another double quote *exactly in
the end-of-field position* (i.e., followed by a newline or field separator).
And it continues consuming everything, until it finds such a
specially-positioned double quote. For the sample above, it means that the
closing double quote after PINS will be consumed without terminating the
"quote-enclosed field" mode (because it's followed by space, not by
end-of-field); and it will not find a suitable double quote until EOF. But the
non-escaped double quote in the middle of a field is not valid for proper
quote-enclosed field!

But let's modify the sample a little:

> 6005347055939,5498,,PUSH "PINS" 1`S, 1`S,14205,Inactive,False,False

The first double quote now isn't in the beginning of the field. And LibreOffice
treats it as a normal field character, properly finding the end of field,
producing the field 'PUSH "PINS" 1`S'.

Now recall that CSV has been used without any formal description for decades
before RFC 4180 came; and the latter standard is a best effort to organize
"best practices", but it cannot simply undo all the pre-existing history - so
what is true for initially-formally-defined standards ("be strict when
generating; be forgiving when consuming") is tenfold true for CSV.

So I suppose that what should have been done here is:
1. Seeing the opening double quote in the beginning of the field, start
"quote-enclosed field" mode.
2. If it encounters something *invalid* for such a mode, it should re-read the
field again, this time without the "quote-enclosed field" mode (to properly
re-consume possible field separators that could had been read in the first pass
as the quoted field content).

This way, this sample would be read properly, without introducing any
ambiguity.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice-bugs/attachments/20190505/46da4c77/attachment-0001.html>


More information about the Libreoffice-bugs mailing list