Trash Can Question
Alexander Larsson
alexl at redhat.com
Sat Aug 22 04:38:17 PDT 2009
On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:
>
>
> 2009/8/21 Alexander Larsson <alexl at redhat.com>
> On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
> > 2009/8/5 David Faure <faure at kde.org>
> > In practice I would recommend using utf8 everywhere
> and
> > getting rid
> > of the whole "filesystem encoding" mess in the first
> place.
> >
> >
> > Who is interested to work a new draft (a draft) of the spec
> which
> > solves this and the other problems emerged?
>
>
> This is not a "problem" that should be "solved". It was very
> delibirately added to the spec in order to allow all files to
> be
> trashed. How would you trash a file named some non-utf8 string
> if only
> utf8 is allowed in the format?
>
> Filenames on linux are zero terminated arrays of bytes. If you
> treat it
> like anything else you will just fail in some corner cases.
>
>
> For me filenames are a list of unicode characters. The way those
> filenames are represented using array of bytes is a different issue.
> As far I know the filesystem is possible to create filename with the
> zero character '\0' or the newline ('\n') in it.
I don't know what operating system you are running, but I'm running
UNIX, and the unix APIs define filenames as a list of bytes, terminated
by a zero byte, where only byte 47 ("/" in ascii) is treated specially.
What you believe filenames are does not really affect things. On the
filesystem a file has a name consisting of an array of bytes, and only
if you can give this exact array of bytes can you open the file. There
is not implicit either encoding or character set, there is only bytes.
A filename may be a string of bytes that are not valid in any existing
encoding of any existing character set. You still need to be able to
represent this in the trashinfo file. And even if the filename *is* in
some valid encoding you don't know which on it is. The various locale
settings can give you a hint about what it may be, and should be used
when you *create* a new filename from a known unicode string. However,
whats on the disk is whats on the disk and have no strict connection to
unicode.
> Of course, we should all move towards all filenames being in
> UTF8, avoid
> creating non-UTF8 filenames, etc.
>
>
> This sound strange to me, UTF-8 is about encoding not about character
> set.
> May be there is a little misunderstanding about utf8, unicode and
> encoding system.
I'm not confused about this.
> It seems to me that you are using the term utf-8 as character set.
>
>
> I see two different aspects:
> 1) which character set the trash system should be able to handle?
The trash system is not about handling any character set at all. It is
about storing the identifier used on the operating system to access the
file. Anything else and there are valid files that the trash system
wouldn't handle.
> 2) how the trash system handle it?
>
> I think that the trash system should be able to manage filenames and
> path expressed in unicode.
> One way to encode unicode characters is UTF-8, but there also UTF-16,
> and others.
So, say you have a filename that is basically a random set of bytes, not
valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings
are, but if you view it in latin-1 its full of unprintable characters
and gobeligok. This is a valid filename in UNIX, and you must give
exactly this array of bytes in order to access the file (to open it,
rename, delete, etc). How do you propose to "encode this in UTF-8"?
> I don't see any problem with filesystem whose filenames aren't encoded
> in non-utf8.
> All the pre-unicode character set are part of unicode and all
> character of unicode can be represented in utf8.
>
> That is a different issue, and should
> not make us limit our specifications to only work on a subset
> of the
> valid filenames.
>
>
> That's true but currently I see the following problems:
> - the subset of valid filenames doesn't contains filenames with '\n'
> or '\0' in it
Oh, the string itself in the file is URI-style encoded, so it can
contain any sort of bytes. (In fact, it can even contain encoded \0s in
it i guess, but that is unlikely to work well, as you can't e.g. pass
such a filename to the kernel since it things supplied filenames stop at
the first zero.)
> - isn't clear (probably only be for me) which encoding should be used
> for reading .trashinfo files.
In general utf8 is the character set of the whole file, but the name
key, when unescaped is to be treated as an array of bytes, representing
the actual filename, which may be of any or no encoding.
> - the uses of character set like latin1 for encoding .trashinfo files
> contents could lead to a loss of information
Since the filename is uri encoded we can store whatever bytes we want in
the filename. However, once that is decoded we can't expect it to be in
any encoding or character set, because that would as you say lead to a
loss of information.
More information about the xdg
mailing list