adding autodetection of delimiter character for CSV files

Mon Jul 30 12:13:07 PDT 2012

Hi Ben,

On Sunday, 2012-07-29 14:18:32 -0400, Ben Manashirov wrote:

> The basic idea is to take a sample amount of lines (e.g. 100).
> 
> - For each line
> - - Count the number of times each character occurs
> - Compute the "peakiness" for each characters occurrence over the lines.
> - Find the character with smallest peakiness.
> 
> The idea is that the delimiter will occurs the same number of times on each
> line, and hence its peakiness will be 0 ideally.

Nice idea. Unfortunately it fails as soon as quoted field content is
involved as all characters within the quoted field are part of the
content, and one field may even wrap over multiple lines. Furthermore,
things are complicated by broken CSV generators that write not properly
quoted fields in which cases the boundaries of a field can only be
determined (or better call it guessed) if the field separator is known.
So while for simple data the approach probably will deliver usable
results, it will easily deliver unusable results for complicated data.

Btw, I wouldn't evaluate all characters <256, only common separator
characters.

However, in the simple data case that does not involve quoted field
content, for which the " double quote character could be assumed
I think and if not present the result be used, your approach could be
used to preselect the separator in the import dialog.

> I'm just presenting this so perhaps someone will add this feature.

Ah, pity, I thought you'd like to implement it :)
That would go into sc/source/ui/dbgui/scuiasciiopt.cxx for the
mbFileImport case.

  Eike

-- 
LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20120730/9dd504c7/attachment.pgp>