(new) non-ASCII filenames break unit tests on Linux

Michael Stahl mst at libreoffice.org
Mon Dec 4 11:10:12 UTC 2023


On 03/12/2023 12:59, Stephan Bergmann wrote:
> On 12/2/23 16:38, Mike Kaganski wrote:
>> On 02.12.2023 17:46, Rene Engelhard wrote:
>>> In any case this is bad. My filesystem (I think from 2020 or so) 
>>> apparently shows it (ls -l does) but I wouldn't be sure for other, 
>>> old ones (like Debians build machines). The locale this fails under 
>>> definitely is UTF-8 though.
> 
> Pre 
> <https://git.libreoffice.org/core/+/fbf025b4903bfcb93c3d4bbf1ebbf860cf11618d%5E%21> "Make testHybridPDFFile Windows-only, and filenames in repo ASCII-only", I can reproduce the failure on Linux when not using an UTF-8 locale but explicitly specifying an e.g. ASCII locale (and thus an osl_getThreadTextEncoding value of RTL_TEXTENCODING_ASCII_US) with `LC_CTYPE=C make -O CppunitTest_filter_textfilterdetect CPPUNIT_TEST_NAME=testHybridPDFFile::TestBody`.
> 
>> But if someone has an idea why LibreOffice fails handling files that 
>> exist on system, with names representable in system encoding, it would 
>> be nice.
> 
> For better or worse, the payload of LO "internal" file URLs is always 
> considered to be a UTF-8 encoding of the actual system pathname.  It is 
> *not* a byte-for-byte representation of the bytes that make up the Unix 
> system pathname.
> 
> What thus happens here is that the file UCP's TaskManager::getv -> 
> osl::DirectoryItem::get -> osl_getDirectoryItem -> 
> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> 
> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> 
> rtl_convertUnicodeToText tries to translate the Unicode chars of 
> "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == 
> RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no 
> representation of the Cyrillic and Greek letters.

in the "C" locale, every 8-bit value is valid, but only ASCII (<128) 
values are meaningful; the intent is that the application does not 
interpret file-names, but uses them as-is, and replacing characters with 
'?' (as apparently happens here) looks wrong to me.

probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.



More information about the LibreOffice mailing list