Character encoding not being detected when using Link to external source in calc

Tue Dec 29 17:15:35 PST 2015

Hi guys,

In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217 -
Persian test in a webpage encoded as UTF-8 is corrupting.

If I take the webpage and save to an HTML file encoded as UTF8, then there
are no problems and the Persian text comes through fine. However, when
connecting to a webserver directly, the HTTP header correctly gives the
content type as utf8.

I did a test using Charles Proxy with its SSL interception feature turned
on and pointed Safari to
https://bugs.documentfoundation.org/attachment.cgi?id=119818

The following headers are gathered:

HTTP/1.1 200 OK
Server: nginx/1.2.1
Date: Sat, 26 Dec 2015 01:41:30 GMT
Content-Type: text/html; name="text.html"; charset=UTF-8
Content-Length: 982
Connection: keep-alive
X-xss-protection: 1; mode=block
Content-disposition: inline; filename="text.html"
X-content-type-options: nosniff

Some warnings are spat out that it editeng's eehtml can't detect the
encoding. I initially thought it was looking for a BOM, which makes no
sense for a webpage, but that's wrong. Instead, for some reason the headers
don't seem to be processed and the HTML parser is falling back to
ISO-8859-1 and not UTF8 as the character encoding.

We seem to use Neon to make the GET request to the webserver. A few
observations:

1. We detect a server OK response as an error
2. (Probably more to the point) I believe PROPFIND is being used, but
actually even though the function being used indicates a PROPFIND verb is
used a GET is used as is normal but the headers aren't being stored. This
,Evans that when the parser looks for the headers to find the encoding it's
not finding anything, resulting in a fallback to ISO-8859-1.

One easy thing (doesn't solve the root issue) is that wouldn't it be a
better idea to fallback to UTF8 and not ISO-8859-1, given ISO-8859-1 is
really just a subset of UTF-8?

Any pointers on how to get to the bottom of this would be appreciated, I'm
honestly not up on webdav or Neon.

Chris Sherlock
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20151230/c7a482d9/attachment.html>