(new) non-ASCII filenames break unit tests on Linux
Stephan Bergmann
stephan.bergmann at allotropia.de
Mon Dec 4 12:05:39 UTC 2023
On 12/4/23 12:10, Michael Stahl wrote:
> On 03/12/2023 12:59, Stephan Bergmann wrote:
>> For better or worse, the payload of LO "internal" file URLs is always
>> considered to be a UTF-8 encoding of the actual system pathname. It
>> is *not* a byte-for-byte representation of the bytes that make up the
>> Unix system pathname.
>>
>> What thus happens here is that the file UCP's TaskManager::getv ->
>> osl::DirectoryItem::get -> osl_getDirectoryItem ->
>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl ->
>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert ->
>> rtl_convertUnicodeToText tries to translate the Unicode chars of
>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() ==
>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no
>> representation of the Cyrillic and Greek letters.
>
> in the "C" locale, every 8-bit value is valid, but only ASCII (<128)
> values are meaningful; the intent is that the application does not
> interpret file-names, but uses them as-is, and replacing characters with
> '?' (as apparently happens here) looks wrong to me.
>
> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
That's not the issue here (the issue is that "ASCII has no
representation of the Cyrillic and Greek letters"), and the existing
RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step
from a Unicode file URL payload to a byte sequence pathname.
More information about the LibreOffice
mailing list