Character encoding not being detected when using Link to external source in calc

Sun Jan 3 21:23:12 PST 2016

Hi Chris,

As recently I'm working on SvParser and HTMLParser,

There is BOM detection is in SvParser::GetNextChar().

A quick look at eehtml, EditHTMLParser::EditHTMLParser seems relevant.

Best regards.

2016-01-04 12:02 GMT+08:00 Chris Sherlock <chris.sherlock79 at gmail.com>:

> Hey guys,
>
> Probably nobody saw this because of the time of year (Happy New Year,
> incidentally!!!).
>
> Just a quick ping to the list to see if anyone can give me some pointers.
>
> Chris
>
> On 30 Dec 2015, at 12:15 PM, Chris Sherlock <chris.sherlock79 at gmail.com>
> wrote:
>
> Hi guys,
>
> In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217
> - Persian test in a webpage encoded as UTF-8 is corrupting.
>
> If I take the webpage and save to an HTML file encoded as UTF8, then there
> are no problems and the Persian text comes through fine. However, when
> connecting to a webserver directly, the HTTP header correctly gives the
> content type as utf8.
>
> I did a test using Charles Proxy with its SSL interception feature turned
> on and pointed Safari to
> https://bugs.documentfoundation.org/attachment.cgi?id=119818
>
> The following headers are gathered:
>
> HTTP/1.1 200 OK
> Server: nginx/1.2.1
> Date: Sat, 26 Dec 2015 01:41:30 GMT
> Content-Type: text/html; name="text.html"; charset=UTF-8
> Content-Length: 982
> Connection: keep-alive
> X-xss-protection: 1; mode=block
> Content-disposition: inline; filename="text.html"
> X-content-type-options: nosniff
>
>
> Some warnings are spat out that it editeng's eehtml can't detect the
> encoding. I initially thought it was looking for a BOM, which makes no
> sense for a webpage, but that's wrong. Instead, for some reason the headers
> don't seem to be processed and the HTML parser is falling back to
> ISO-8859-1 and not UTF8 as the character encoding.
>
> We seem to use Neon to make the GET request to the webserver. A few
> observations:
>
> 1. We detect a server OK response as an error
> 2. (Probably more to the point) I believe PROPFIND is being used, but
> actually even though the function being used indicates a PROPFIND verb is
> used a GET is used as is normal but the headers aren't being stored. This
> ,Evans that when the parser looks for the headers to find the encoding it's
> not finding anything, resulting in a fallback to ISO-8859-1.
>
> One easy thing (doesn't solve the root issue) is that wouldn't it be a
> better idea to fallback to UTF8 and not ISO-8859-1, given ISO-8859-1 is
> really just a subset of UTF-8?
>
> Any pointers on how to get to the bottom of this would be appreciated, I'm
> honestly not up on webdav or Neon.
>
> Chris Sherlock
>
>
>
> _______________________________________________
> LibreOffice mailing list
> LibreOffice at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>
>

-- 
Mark Hung
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20160104/c9d5f49d/attachment.html>