(new) non-ASCII filenames break unit tests on Linux
Michael Stahl
mst at libreoffice.org
Fri Dec 8 10:30:47 UTC 2023
On 04/12/2023 13:05, Stephan Bergmann wrote:
> On 12/4/23 12:10, Michael Stahl wrote:
>> On 03/12/2023 12:59, Stephan Bergmann wrote:
>>> For better or worse, the payload of LO "internal" file URLs is always
>>> considered to be a UTF-8 encoding of the actual system pathname. It
>>> is *not* a byte-for-byte representation of the bytes that make up the
>>> Unix system pathname.
>>>
>>> What thus happens here is that the file UCP's TaskManager::getv ->
>>> osl::DirectoryItem::get -> osl_getDirectoryItem ->
>>> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl ->
>>> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert ->
>>> rtl_convertUnicodeToText tries to translate the Unicode chars of
>>> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() ==
>>> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has
>>> no representation of the Cyrillic and Greek letters.
>>
>> in the "C" locale, every 8-bit value is valid, but only ASCII (<128)
>> values are meaningful; the intent is that the application does not
>> interpret file-names, but uses them as-is, and replacing characters
>> with '?' (as apparently happens here) looks wrong to me.
>>
>> probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
>
> That's not the issue here (the issue is that "ASCII has no
> representation of the Cyrillic and Greek letters"), and the existing
> RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step
> from a Unicode file URL payload to a byte sequence pathname.
it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or
anything else because the meaning of non-ASCII characters in "C" locale
is unspecified.
... considering that LO uses UTF-16 strings for everything including
file paths, perhaps the best thing would be to add a check for the "C"
locale on startup, print an error and abort.
More information about the LibreOffice
mailing list