Trash Can Question

Sun Aug 23 05:16:58 PDT 2009

I reviewed the messages from Larsson and PCMag and I think I was wrong about
what constitutes an identifier for a file in a filesystem. Thanks to you
both for helping me to review my opinions.
I think that there are also something that I don't get, I'll write about my
perplexities after I reviewed some more documentations.

2009/8/22 Alexander Larsson <alexl at redhat.com>

> On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:
> >
> >
> > 2009/8/21 Alexander Larsson <alexl at redhat.com>
> >         On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
> >         > 2009/8/5 David Faure <faure at kde.org>
> >         >         In practice I would recommend using utf8 everywhere
> >         and
> >         >         getting rid
> >         >         of the whole "filesystem encoding" mess in the first
> >         place.
> >         >
> >         >
> >         > Who is interested to work a new draft (a draft) of the spec
> >         which
> >         > solves this and the other problems emerged?
> >
> >
> >         This is not a "problem" that should be "solved". It was very
> >         delibirately added to the spec in order to allow all files to
> >         be
> >         trashed. How would you trash a file named some non-utf8 string
> >         if only
> >         utf8 is allowed in the format?
> >
> >         Filenames on linux are zero terminated arrays of bytes. If you
> >         treat it
> >         like anything else you will just fail in some corner cases.
> >
> >
> > For me filenames are a list of unicode characters. The way those
> > filenames are represented using array of bytes is a different issue.
> > As far I know the filesystem is possible to create filename with the
> > zero character '\0' or the newline ('\n') in it.
>
> I don't know what operating system you are running, but I'm running
> UNIX, and the unix APIs define filenames as a list of bytes, terminated
> by a zero byte, where only byte 47 ("/" in ascii) is treated specially.
> What you believe filenames are does not really affect things. On the
> filesystem a file has a name consisting of an array of bytes, and only
> if you can give this exact array of bytes can you open the file. There
> is not implicit either encoding or character set, there is only bytes.
>
> A filename may be a string of bytes that are not valid in any existing
> encoding of any existing character set. You still need to be able to
> represent this in the trashinfo file. And even if the filename *is* in
> some valid encoding you don't know which on it is. The various locale
> settings can give you a hint about what it may be, and should be used
> when you *create* a new filename from a known unicode string. However,
> whats on the disk is whats on the disk and have no strict connection to
> unicode.
>
> >         Of course, we should all move towards all filenames being in
> >         UTF8, avoid
> >         creating non-UTF8 filenames, etc.
> >
> >
> > This sound strange to me, UTF-8 is about encoding not about character
> > set.
> > May be there is a little misunderstanding about utf8, unicode and
> > encoding system.
>
> I'm not confused about this.
>
> > It seems to me that you are using the term utf-8 as character set.
> >
> >
> > I see two different aspects:
> >  1) which character set the trash system should be able to handle?
>
> The trash system is not about handling any character set at all. It is
> about storing the identifier used on the operating system to access the
> file. Anything else and there are valid files that the trash system
> wouldn't handle.
>
> >  2) how the trash system handle it?
> >
> > I think that the trash system should be able to manage filenames and
> > path expressed in unicode.
> > One way to encode unicode characters is UTF-8, but there also UTF-16,
> > and others.
>
> So, say you have a filename that is basically a random set of bytes, not
> valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings
> are, but if you view it in latin-1 its full of unprintable characters
> and gobeligok. This is a valid filename in UNIX, and you must give
> exactly this array of bytes in order to access the file (to open it,
> rename, delete, etc). How do you propose to "encode this in UTF-8"?
>
> > I don't see any problem with filesystem whose filenames aren't encoded
> > in non-utf8.
> > All the pre-unicode character set are part of unicode and all
> > character of unicode can be represented in utf8.
> >
> >         That is a different issue, and should
> >         not make us limit our specifications to only work on a subset
> >         of the
> >         valid filenames.
> >
> >
> > That's true but currently I see the following problems:
> >  - the subset of valid filenames doesn't contains filenames with '\n'
> > or '\0' in it
>
> Oh, the string itself in the file is URI-style encoded, so it can
> contain any sort of bytes. (In fact, it can even contain encoded \0s in
> it i guess, but that is unlikely to work well, as you can't e.g. pass
> such a filename to the kernel since it things supplied filenames stop at
> the first zero.)
>
> >  - isn't clear (probably only be for me) which encoding should be used
> > for reading .trashinfo files.
>
> In general utf8 is the character set of the whole file, but the name
> key, when unescaped is to be treated as an array of bytes, representing
> the actual filename, which may be of any or no encoding.
>
> >  - the uses of character set like latin1 for encoding .trashinfo files
> > contents could lead to a loss of information
>
> Since the filename is uri encoded we can store whatever bytes we want in
> the filename. However, once that is decoded we can't expect it to be in
> any encoding or character set, because that would as you say lead to a
> loss of information.
>
>
>
>

-- 
Andrea Francia
http://andreafrancia.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20090823/ad1880e0/attachment.html