I reviewed the messages from Larsson and PCMag and I think I was wrong about what constitutes an identifier for a file in a filesystem. Thanks to you both for helping me to review my opinions. <div><br></div><div>I think that there are also something that I don't get, I'll write about my perplexities after I reviewed some more documentations.<br>
<div><div><br></div><div><div><div><div class="gmail_quote">2009/8/22 Alexander Larsson <span dir="ltr"><<a href="mailto:alexl@redhat.com">alexl@redhat.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div></div><div class="h5">On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:<br>
><br>
><br>
> 2009/8/21 Alexander Larsson <<a href="mailto:alexl@redhat.com">alexl@redhat.com</a>><br>
> On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:<br>
> > 2009/8/5 David Faure <<a href="mailto:faure@kde.org">faure@kde.org</a>><br>
> > In practice I would recommend using utf8 everywhere<br>
> and<br>
> > getting rid<br>
> > of the whole "filesystem encoding" mess in the first<br>
> place.<br>
> ><br>
> ><br>
> > Who is interested to work a new draft (a draft) of the spec<br>
> which<br>
> > solves this and the other problems emerged?<br>
><br>
><br>
> This is not a "problem" that should be "solved". It was very<br>
> delibirately added to the spec in order to allow all files to<br>
> be<br>
> trashed. How would you trash a file named some non-utf8 string<br>
> if only<br>
> utf8 is allowed in the format?<br>
><br>
> Filenames on linux are zero terminated arrays of bytes. If you<br>
> treat it<br>
> like anything else you will just fail in some corner cases.<br>
><br>
><br>
> For me filenames are a list of unicode characters. The way those<br>
> filenames are represented using array of bytes is a different issue.<br>
> As far I know the filesystem is possible to create filename with the<br>
> zero character '\0' or the newline ('\n') in it.<br>
<br>
</div></div>I don't know what operating system you are running, but I'm running<br>
UNIX, and the unix APIs define filenames as a list of bytes, terminated<br>
by a zero byte, where only byte 47 ("/" in ascii) is treated specially.<br>
What you believe filenames are does not really affect things. On the<br>
filesystem a file has a name consisting of an array of bytes, and only<br>
if you can give this exact array of bytes can you open the file. There<br>
is not implicit either encoding or character set, there is only bytes.<br>
<br>
A filename may be a string of bytes that are not valid in any existing<br>
encoding of any existing character set. You still need to be able to<br>
represent this in the trashinfo file. And even if the filename *is* in<br>
some valid encoding you don't know which on it is. The various locale<br>
settings can give you a hint about what it may be, and should be used<br>
when you *create* a new filename from a known unicode string. However,<br>
whats on the disk is whats on the disk and have no strict connection to<br>
unicode.<br>
<div class="im"><br>
> Of course, we should all move towards all filenames being in<br>
> UTF8, avoid<br>
> creating non-UTF8 filenames, etc.<br>
><br>
><br>
> This sound strange to me, UTF-8 is about encoding not about character<br>
> set.<br>
> May be there is a little misunderstanding about utf8, unicode and<br>
> encoding system.<br>
<br>
</div>I'm not confused about this.<br>
<div class="im"><br>
> It seems to me that you are using the term utf-8 as character set.<br>
><br>
><br>
> I see two different aspects:<br>
> 1) which character set the trash system should be able to handle?<br>
<br>
</div>The trash system is not about handling any character set at all. It is<br>
about storing the identifier used on the operating system to access the<br>
file. Anything else and there are valid files that the trash system<br>
wouldn't handle.<br>
<div class="im"><br>
> 2) how the trash system handle it?<br>
><br>
> I think that the trash system should be able to manage filenames and<br>
> path expressed in unicode.<br>
> One way to encode unicode characters is UTF-8, but there also UTF-16,<br>
> and others.<br>
<br>
</div>So, say you have a filename that is basically a random set of bytes, not<br>
valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings<br>
are, but if you view it in latin-1 its full of unprintable characters<br>
and gobeligok. This is a valid filename in UNIX, and you must give<br>
exactly this array of bytes in order to access the file (to open it,<br>
rename, delete, etc). How do you propose to "encode this in UTF-8"?<br>
<div class="im"><br>
> I don't see any problem with filesystem whose filenames aren't encoded<br>
> in non-utf8.<br>
> All the pre-unicode character set are part of unicode and all<br>
> character of unicode can be represented in utf8.<br>
><br>
> That is a different issue, and should<br>
> not make us limit our specifications to only work on a subset<br>
> of the<br>
> valid filenames.<br>
><br>
><br>
> That's true but currently I see the following problems:<br>
> - the subset of valid filenames doesn't contains filenames with '\n'<br>
> or '\0' in it<br>
<br>
</div>Oh, the string itself in the file is URI-style encoded, so it can<br>
contain any sort of bytes. (In fact, it can even contain encoded \0s in<br>
it i guess, but that is unlikely to work well, as you can't e.g. pass<br>
such a filename to the kernel since it things supplied filenames stop at<br>
the first zero.)<br>
<div class="im"><br>
> - isn't clear (probably only be for me) which encoding should be used<br>
> for reading .trashinfo files.<br>
<br>
</div>In general utf8 is the character set of the whole file, but the name<br>
key, when unescaped is to be treated as an array of bytes, representing<br>
the actual filename, which may be of any or no encoding.<br>
<div class="im"><br>
> - the uses of character set like latin1 for encoding .trashinfo files<br>
> contents could lead to a loss of information<br>
<br>
</div>Since the filename is uri encoded we can store whatever bytes we want in<br>
the filename. However, once that is decoded we can't expect it to be in<br>
any encoding or character set, because that would as you say lead to a<br>
loss of information.<br>
<br>
<br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>Andrea Francia<br><a href="http://andreafrancia.blogspot.com/">http://andreafrancia.blogspot.com/</a><br>
</div></div></div></div></div>