HTML import filter very (too) basic

Sat Jan 12 17:46:26 UTC 2019

> On January 9, 2019 at 5:22 AM Noel Grandin <noelgrandin at gmail.com> wrote:
> 
> 
> On Wed, 9 Jan 2019 at 10:25, Jens Tröger <jens.troeger at light-speed.de>
> wrote:
> 
> >
> > > On Jan 9, 2019, at 16:06, Noel Grandin <noelgrandin at gmail.com> wrote:
> > >
> > > Nobody owns it, and you're welcome to file a bug, but there are already
> > a ton of HTML import bugs, our support is really very basic.
> >
> > Well I’ve noticed in the past that bugs regarding the HTML filter received
> > very little attention, unfortunately. If not the existing import filter,
> > are there efforts to implement alternatives?
> >
> >
> HTML is a fairly massive beast, so there are __always__ going to be bugs in
> our import filter.
> 
> I believe that Kohei started doing some parsing work over in the orcus
> library at
>    https://gitlab.com/orcus/orcus
> and we use some of that (e.g. very very basic CSS parsing) somewhere in our
> code.

Just to clarify on this a bit.  The orcus library provides a C++ template based CSS parser which supports a pretty wide variety of the current CSS feature set.  It's not 100% feature complete, but it does handle more than just a basic set of CSS structures, to say the least.

We currently use that orcus CSS parser to handle some very basic cell formatting imports in Calc for now, but that can be extended if needed.

Now, on the Writer side it's a different story.  There are *some* code sharing between Writer and Calc wrt HTML parsing, but the CSS parsing code is not shared between the two.  AFAICR Writer has its own CSS parser that does not use the orcus CSS parser, and nobody is maintaining that code right now.

> But for normal HTML we still use our own parser.

Yup.

> And the parsing is only a very small part anyhow, most of the work is in
> converting the HTML model to our own document model.

Yes, and that part is handled independently between Writer and Calc.

Kohei

--
Kohei Yoshida, LibreOffice Calc volunteer hacker