(new) non-ASCII filenames break unit tests on Linux

Michael Stahl mst at libreoffice.org
Fri Dec 8 10:30:47 UTC 2023


On 04/12/2023 13:05, Stephan Bergmann wrote:
> On 12/4/23 12:10, Michael Stahl wrote:
>> On 03/12/2023 12:59, Stephan Bergmann wrote:
>>> For better or worse, the payload of LO "internal" file URLs is always 
>>> considered to be a UTF-8 encoding of the actual system pathname.  It 
>>> is *not* a byte-for-byte representation of the bytes that make up the 
>>> Unix system pathname.
>>>
>>> What thus happens here is that the file UCP's TaskManager::getv -> 
>>> osl::DirectoryItem::get -> osl_getDirectoryItem -> 
>>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> 
>>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> 
>>> rtl_convertUnicodeToText tries to translate the Unicode chars of 
>>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == 
>>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has 
>>> no representation of the Cyrillic and Greek letters.
>>
>> in the "C" locale, every 8-bit value is valid, but only ASCII (<128) 
>> values are meaningful; the intent is that the application does not 
>> interpret file-names, but uses them as-is, and replacing characters 
>> with '?' (as apparently happens here) looks wrong to me.
>>
>> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
> 
> That's not the issue here (the issue is that "ASCII has no 
> representation of the Cyrillic and Greek letters"), and the existing 
> RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step 
> from a Unicode file URL payload to a byte sequence pathname.

it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or 
anything else because the meaning of non-ASCII characters in "C" locale 
is unspecified.

... considering that LO uses UTF-16 strings for everything including 
file paths, perhaps the best thing would be to add a check for the "C" 
locale on startup, print an error and abort.


More information about the LibreOffice mailing list