(new) non-ASCII filenames break unit tests on Linux

Mon Dec 4 12:05:39 UTC 2023

On 12/4/23 12:10, Michael Stahl wrote:
> On 03/12/2023 12:59, Stephan Bergmann wrote:
>> For better or worse, the payload of LO "internal" file URLs is always 
>> considered to be a UTF-8 encoding of the actual system pathname.  It 
>> is *not* a byte-for-byte representation of the bytes that make up the 
>> Unix system pathname.
>>
>> What thus happens here is that the file UCP's TaskManager::getv -> 
>> osl::DirectoryItem::get -> osl_getDirectoryItem -> 
>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> 
>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> 
>> rtl_convertUnicodeToText tries to translate the Unicode chars of 
>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == 
>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no 
>> representation of the Cyrillic and Greek letters.
> 
> in the "C" locale, every 8-bit value is valid, but only ASCII (<128) 
> values are meaningful; the intent is that the application does not 
> interpret file-names, but uses them as-is, and replacing characters with 
> '?' (as apparently happens here) looks wrong to me.
> 
> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.

That's not the issue here (the issue is that "ASCII has no 
representation of the Cyrillic and Greek letters"), and the existing 
RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step 
from a Unicode file URL payload to a byte sequence pathname.