Character encoding not being detected when using Link to external source in calc

Mark Hung marklh9 at gmail.com
Thu Jan 28 08:26:09 PST 2016


http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454

-    else if ( rName == "Content-Type" )
+    else if ( rName.equalsIgnoreAsciiCaseAscii("Content-Type"))



2016-01-28 9:16 GMT+08:00 Chris Sherlock <chris.sherlock79 at gmail.com>:

> Hi guys,
>
> I’m afraid I’m still a bit stuck on this, any other ideas what might be
> causing the problem?
>
> Chris
>
> On 6 Jan 2016, at 4:27 AM, Chris Sherlock <chris.sherlock79 at gmail.com>
> wrote:
>
> Thanks Mark, appreciate these code pointers!
>
> (I’m cc’ing in the mailing list so others can comment)
>
> Chris
>
> On 4 Jan 2016, at 8:21 PM, Mark Hung <marklh9 at gmail.com> wrote:
>
>
> I meant there is a chance for SvParser::GetNextChar() to switch encoding,
> but yes it is less relevant.
>
> Grepping content-type under ucb , there are some suspicious code
>
> http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454
>
> http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav/ContentProperties.cxx#471
>
> Which seems incosistent with
>
> http://opengrok.libreoffice.org/xref/core/sc/source/filter/html/htmlpars.cxx#264
>
>
> 2016-01-04 16:17 GMT+08:00 Chris Sherlock <chris.sherlock79 at gmail.com>:
>
>> Hi Mark,
>>
>> BOM detection is irrelevant here. The HTTP header states that it should
>> be UTF8, but this is not being honoured.
>>
>> There is something further down the stack that isn’t recording the HTTP
>> headers.
>>
>> Chris
>>
>> On 4 Jan 2016, at 4:23 PM, Mark Hung <marklh9 at gmail.com> wrote:
>>
>> Hi Chris,
>>
>> As recently I'm working on SvParser and HTMLParser,
>>
>> There is BOM detection is in SvParser::GetNextChar().
>>
>> A quick look at eehtml, EditHTMLParser::EditHTMLParser seems relevant.
>>
>> Best regards.
>>
>>
>> 2016-01-04 12:02 GMT+08:00 Chris Sherlock <chris.sherlock79 at gmail.com>:
>>
>>> Hey guys,
>>>
>>> Probably nobody saw this because of the time of year (Happy New Year,
>>> incidentally!!!).
>>>
>>> Just a quick ping to the list to see if anyone can give me some
>>> pointers.
>>>
>>> Chris
>>>
>>> On 30 Dec 2015, at 12:15 PM, Chris Sherlock <chris.sherlock79 at gmail.com>
>>> wrote:
>>>
>>> Hi guys,
>>>
>>> In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217
>>> - Persian test in a webpage encoded as UTF-8 is corrupting.
>>>
>>> If I take the webpage and save to an HTML file encoded as UTF8, then
>>> there are no problems and the Persian text comes through fine. However,
>>> when connecting to a webserver directly, the HTTP header correctly gives
>>> the content type as utf8.
>>>
>>> I did a test using Charles Proxy with its SSL interception feature
>>> turned on and pointed Safari to
>>> https://bugs.documentfoundation.org/attachment.cgi?id=119818
>>>
>>> The following headers are gathered:
>>>
>>> HTTP/1.1 200 OK
>>> Server: nginx/1.2.1
>>> Date: Sat, 26 Dec 2015 01:41:30 GMT
>>> Content-Type: text/html; name="text.html"; charset=UTF-8
>>> Content-Length: 982
>>> Connection: keep-alive
>>> X-xss-protection: 1; mode=block
>>> Content-disposition: inline; filename="text.html"
>>> X-content-type-options: nosniff
>>>
>>>
>>> Some warnings are spat out that it editeng's eehtml can't detect the
>>> encoding. I initially thought it was looking for a BOM, which makes no
>>> sense for a webpage, but that's wrong. Instead, for some reason the headers
>>> don't seem to be processed and the HTML parser is falling back to
>>> ISO-8859-1 and not UTF8 as the character encoding.
>>>
>>> We seem to use Neon to make the GET request to the webserver. A few
>>> observations:
>>>
>>> 1. We detect a server OK response as an error
>>> 2. (Probably more to the point) I believe PROPFIND is being used, but
>>> actually even though the function being used indicates a PROPFIND verb is
>>> used a GET is used as is normal but the headers aren't being stored. This
>>> ,Evans that when the parser looks for the headers to find the encoding it's
>>> not finding anything, resulting in a fallback to ISO-8859-1.
>>>
>>> One easy thing (doesn't solve the root issue) is that wouldn't it be a
>>> better idea to fallback to UTF8 and not ISO-8859-1, given ISO-8859-1 is
>>> really just a subset of UTF-8?
>>>
>>> Any pointers on how to get to the bottom of this would be appreciated,
>>> I'm honestly not up on webdav or Neon.
>>>
>>> Chris Sherlock
>>>
>>>
>>>
>>> _______________________________________________
>>> LibreOffice mailing list
>>> LibreOffice at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>>>
>>>
>>
>>
>> --
>> Mark Hung
>>
>>
>>
>
>
> --
> Mark Hung
>
>
>
>
> _______________________________________________
> LibreOffice mailing list
> LibreOffice at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>
>


-- 
Mark Hung
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20160129/13459075/attachment.html>


More information about the LibreOffice mailing list