I reviewed the messages from Larsson and PCMag and I think I was wrong about what constitutes an identifier for a file in a filesystem. Thanks to you both for helping me to review my opinions. <div><br></div><div>I think that there are also something that I don't get, I'll write about my perplexities after I reviewed some more documentations.<br> <div><div><br></div><div><div><div><div class="gmail_quote">2009/8/22 Alexander Larsson <span dir="ltr"><<a href="mailto:alexl@redhat.com">alexl@redhat.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div><div></div><div class="h5">On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:<br> ><br> ><br> > 2009/8/21 Alexander Larsson <<a href="mailto:alexl@redhat.com">alexl@redhat.com</a>><br> > On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:<br> > > 2009/8/5 David Faure <<a href="mailto:faure@kde.org">faure@kde.org</a>><br> > > In practice I would recommend using utf8 everywhere<br> > and<br> > > getting rid<br> > > of the whole "filesystem encoding" mess in the first<br> > place.<br> > ><br> > ><br> > > Who is interested to work a new draft (a draft) of the spec<br> > which<br> > > solves this and the other problems emerged?<br> ><br> ><br> > This is not a "problem" that should be "solved". It was very<br> > delibirately added to the spec in order to allow all files to<br> > be<br> > trashed. How would you trash a file named some non-utf8 string<br> > if only<br> > utf8 is allowed in the format?<br> ><br> > Filenames on linux are zero terminated arrays of bytes. If you<br> > treat it<br> > like anything else you will just fail in some corner cases.<br> ><br> ><br> > For me filenames are a list of unicode characters. The way those<br> > filenames are represented using array of bytes is a different issue.<br> > As far I know the filesystem is possible to create filename with the<br> > zero character '\0' or the newline ('\n') in it.<br> <br> </div></div>I don't know what operating system you are running, but I'm running<br> UNIX, and the unix APIs define filenames as a list of bytes, terminated<br> by a zero byte, where only byte 47 ("/" in ascii) is treated specially.<br> What you believe filenames are does not really affect things. On the<br> filesystem a file has a name consisting of an array of bytes, and only<br> if you can give this exact array of bytes can you open the file. There<br> is not implicit either encoding or character set, there is only bytes.<br> <br> A filename may be a string of bytes that are not valid in any existing<br> encoding of any existing character set. You still need to be able to<br> represent this in the trashinfo file. And even if the filename *is* in<br> some valid encoding you don't know which on it is. The various locale<br> settings can give you a hint about what it may be, and should be used<br> when you *create* a new filename from a known unicode string. However,<br> whats on the disk is whats on the disk and have no strict connection to<br> unicode.<br> <div class="im"><br> > Of course, we should all move towards all filenames being in<br> > UTF8, avoid<br> > creating non-UTF8 filenames, etc.<br> ><br> ><br> > This sound strange to me, UTF-8 is about encoding not about character<br> > set.<br> > May be there is a little misunderstanding about utf8, unicode and<br> > encoding system.<br> <br> </div>I'm not confused about this.<br> <div class="im"><br> > It seems to me that you are using the term utf-8 as character set.<br> ><br> ><br> > I see two different aspects:<br> > 1) which character set the trash system should be able to handle?<br> <br> </div>The trash system is not about handling any character set at all. It is<br> about storing the identifier used on the operating system to access the<br> file. Anything else and there are valid files that the trash system<br> wouldn't handle.<br> <div class="im"><br> > 2) how the trash system handle it?<br> ><br> > I think that the trash system should be able to manage filenames and<br> > path expressed in unicode.<br> > One way to encode unicode characters is UTF-8, but there also UTF-16,<br> > and others.<br> <br> </div>So, say you have a filename that is basically a random set of bytes, not<br> valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings<br> are, but if you view it in latin-1 its full of unprintable characters<br> and gobeligok. This is a valid filename in UNIX, and you must give<br> exactly this array of bytes in order to access the file (to open it,<br> rename, delete, etc). How do you propose to "encode this in UTF-8"?<br> <div class="im"><br> > I don't see any problem with filesystem whose filenames aren't encoded<br> > in non-utf8.<br> > All the pre-unicode character set are part of unicode and all<br> > character of unicode can be represented in utf8.<br> ><br> > That is a different issue, and should<br> > not make us limit our specifications to only work on a subset<br> > of the<br> > valid filenames.<br> ><br> ><br> > That's true but currently I see the following problems:<br> > - the subset of valid filenames doesn't contains filenames with '\n'<br> > or '\0' in it<br> <br> </div>Oh, the string itself in the file is URI-style encoded, so it can<br> contain any sort of bytes. (In fact, it can even contain encoded \0s in<br> it i guess, but that is unlikely to work well, as you can't e.g. pass<br> such a filename to the kernel since it things supplied filenames stop at<br> the first zero.)<br> <div class="im"><br> > - isn't clear (probably only be for me) which encoding should be used<br> > for reading .trashinfo files.<br> <br> </div>In general utf8 is the character set of the whole file, but the name<br> key, when unescaped is to be treated as an array of bytes, representing<br> the actual filename, which may be of any or no encoding.<br> <div class="im"><br> > - the uses of character set like latin1 for encoding .trashinfo files<br> > contents could lead to a loss of information<br> <br> </div>Since the filename is uri encoded we can store whatever bytes we want in<br> the filename. However, once that is decoded we can't expect it to be in<br> any encoding or character set, because that would as you say lead to a<br> loss of information.<br> <br> <br> <br> </blockquote></div><br><br clear="all"><br>-- <br>Andrea Francia<br><a href="http://andreafrancia.blogspot.com/">http://andreafrancia.blogspot.com/</a><br> </div></div></div></div></div>