HTML import filter very (too) basic

Sat Jan 12 19:56:10 UTC 2019

Thank you Kohei for the information!

Noel mentioned that much of Writer’s current implementation covers the conversion of HTML into Writer’s DOM. How complete is that step? It sounds like an iterative approach to me: once the DOM tree has been replicated, the CSS needs to be cascaded accordingly—and that step seems to be rather incomplete?

Again, I haven’t had any time to noodle through the Writer code so I am just trying to put together a picture based on this discussion…

Many greetings,
Jens

> On Jan 13, 2019, at 03:46, Kohei Yoshida <libreoffice at kohei.us> wrote:
> 
>> I believe that Kohei started doing some parsing work over in the orcus
>> library at
>>   https://gitlab.com/orcus/orcus
>> and we use some of that (e.g. very very basic CSS parsing) somewhere in our
>> code.
> 
> Just to clarify on this a bit.  The orcus library provides a C++ template based CSS parser which supports a pretty wide variety of the current CSS feature set.  It's not 100% feature complete, but it does handle more than just a basic set of CSS structures, to say the least.
> 
> We currently use that orcus CSS parser to handle some very basic cell formatting imports in Calc for now, but that can be extended if needed.
> 
> Now, on the Writer side it's a different story.  There are *some* code sharing between Writer and Calc wrt HTML parsing, but the CSS parsing code is not shared between the two.  AFAICR Writer has its own CSS parser that does not use the orcus CSS parser, and nobody is maintaining that code right now.
> 
>> But for normal HTML we still use our own parser.
> 
> Yup.
> 
>> And the parsing is only a very small part anyhow, most of the work is in
>> converting the HTML model to our own document model.
> 
> Yes, and that part is handled independently between Writer and Calc.
> 
> Kohei
> 
> --
> Kohei Yoshida, LibreOffice Calc volunteer hacker

--
Jens Tröger
http://savage.light-speed.de/