(new) non-ASCII filenames break unit tests on Linux
Mike Kaganski
mikekaganski at hotmail.com
Fri Dec 8 10:33:11 UTC 2023
On 08.12.2023 13:30, Michael Stahl wrote:
> On 04/12/2023 13:05, Stephan Bergmann wrote:
>> On 12/4/23 12:10, Michael Stahl wrote:
>>> On 03/12/2023 12:59, Stephan Bergmann wrote:
>>>> For better or worse, the payload of LO "internal" file URLs is
>>>> always considered to be a UTF-8 encoding of the actual system
>>>> pathname. It is *not* a byte-for-byte representation of the bytes
>>>> that make up the Unix system pathname.
>>>>
>>>> What thus happens here is that the file UCP's TaskManager::getv ->
>>>> osl::DirectoryItem::get -> osl_getDirectoryItem ->
>>>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl ->
>>>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert ->
>>>> rtl_convertUnicodeToText tries to translate the Unicode chars of
>>>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() ==
>>>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has
>>>> no representation of the Cyrillic and Greek letters.
>>>
>>> in the "C" locale, every 8-bit value is valid, but only ASCII (<128)
>>> values are meaningful; the intent is that the application does not
>>> interpret file-names, but uses them as-is, and replacing characters
>>> with '?' (as apparently happens here) looks wrong to me.
>>>
>>> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
>>
>> That's not the issue here (the issue is that "ASCII has no
>> representation of the Cyrillic and Greek letters"), and the existing
>> RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step
>> from a Unicode file URL payload to a byte sequence pathname.
>
> it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or
> anything else because the meaning of non-ASCII characters in "C" locale
> is unspecified.
>
> ... considering that LO uses UTF-16 strings for everything including
> file paths, perhaps the best thing would be to add a check for the "C"
> locale on startup, print an error and abort.
Note that the original issue discussed here is not the "C" locale, where
the problem would be expected; nor any non-Unicode locale. But as Rene
told, the locale was UTF-8, and the system handled the files OK.
--
Best regards,
Mike Kaganski
More information about the LibreOffice
mailing list