(new) non-ASCII filenames break unit tests on Linux

Mike Kaganski mikekaganski at hotmail.com
Fri Dec 8 10:33:11 UTC 2023


On 08.12.2023 13:30, Michael Stahl wrote:
> On 04/12/2023 13:05, Stephan Bergmann wrote:
>> On 12/4/23 12:10, Michael Stahl wrote:
>>> On 03/12/2023 12:59, Stephan Bergmann wrote:
>>>> For better or worse, the payload of LO "internal" file URLs is 
>>>> always considered to be a UTF-8 encoding of the actual system 
>>>> pathname.  It is *not* a byte-for-byte representation of the bytes 
>>>> that make up the Unix system pathname.
>>>>
>>>> What thus happens here is that the file UCP's TaskManager::getv -> 
>>>> osl::DirectoryItem::get -> osl_getDirectoryItem -> 
>>>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> 
>>>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> 
>>>> rtl_convertUnicodeToText tries to translate the Unicode chars of 
>>>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == 
>>>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has 
>>>> no representation of the Cyrillic and Greek letters.
>>>
>>> in the "C" locale, every 8-bit value is valid, but only ASCII (<128) 
>>> values are meaningful; the intent is that the application does not 
>>> interpret file-names, but uses them as-is, and replacing characters 
>>> with '?' (as apparently happens here) looks wrong to me.
>>>
>>> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
>>
>> That's not the issue here (the issue is that "ASCII has no 
>> representation of the Cyrillic and Greek letters"), and the existing 
>> RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step 
>> from a Unicode file URL payload to a byte sequence pathname.
> 
> it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or 
> anything else because the meaning of non-ASCII characters in "C" locale 
> is unspecified.
> 
> ... considering that LO uses UTF-16 strings for everything including 
> file paths, perhaps the best thing would be to add a check for the "C" 
> locale on startup, print an error and abort.

Note that the original issue discussed here is not the "C" locale, where 
the problem would be expected; nor any non-Unicode locale. But as Rene 
told, the locale was UTF-8, and the system handled the files OK.

-- 
Best regards,
Mike Kaganski



More information about the LibreOffice mailing list