<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64815#c55">Comment # 55</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64815">bug 64815</a>
from <span class="vcard"><a class="email" href="mailto:aperez@igalia.com" title="Adrian Perez de Castro <aperez@igalia.com>"> <span class="fn">Adrian Perez de Castro</span></a>
</span></b>
<pre>(In reply to <a href="show_bug.cgi?id=64815#c50">comment #50</a>)
<span class="quote">> Comment on <span class=""><a href="attachment.cgi?id=86678" name="attach_86678" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">attachment 86678</a> <a href="attachment.cgi?id=86678&action=edit" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64815&attachment=86678'>[review]</a> [review]
> [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
>
> Review of <span class=""><a href="attachment.cgi?id=86678" name="attach_86678" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">attachment 86678</a> <a href="attachment.cgi?id=86678&action=edit" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64815&attachment=86678'>[review]</a> [review]:
> -----------------------------------------------------------------
>
> Looks good in general, my main concern is that some code is duplicated with
> TextOutputDev, so I prefer if we could somehoe add support for marked
> content to TextOutputDev instead of inventing yet another output dev. All
> those abbreviations makes the code a bit more difficult to read for me.</span >
New version of the patch ditches the “MC” abbreviation and it appears
expanded to “MarkedContent”.
<span class="quote">> ::: poppler/MCOutputDev.cc
> @@ +34,5 @@
> > +};
> > +
> > +
> > +MCOutputDev::MCOutputDev(int mcid):
> > + p(new Priv(mcid))
>
> Why do we need a heap allocated private struct?</span >
We don't, I first wrote this before being aware that the Poppler low-level
API is not considered public (as in “public stable API”).
<span class="quote">> @@ +66,5 @@
> > +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> > +{
> > + int id = -1;
> > + if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> > + p->capturing = true;
>
> capturing sounds a bit confusing to me, I would use something like
> inMarkedContent</span >
Changed.
<span class="quote">> @@ +149,5 @@
> > + if (p->lastFlags != flags) {
> > + if (p->capturing)
> > + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > + p->lastFlags = flags;
> > + }
>
> It seems you are not adding any color "op". Either remove the color type or
> add it when it changes in ::drawChar (or beginWord)</span >
Ooops. New version of the patch adds the missing code.
<span class="quote">> @@ +150,5 @@
> > + if (p->capturing)
> > + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > + p->lastFlags = flags;
> > + }
> > +}
>
> I wonder if we can integrate all this with TextOutputDev somehow. There's
> some code duplication and this is about extracting text after all. Maybe we
> could add a MarkedContentText object, similar to TextPage or ActualText
> objects, and use it from the TextOutputDev. We could add a new constructor
> that only creates a MarkedContentText and receives the marked content id,
> and then you can get the text and attributes.
>
> ::: poppler/MCOutputDev.h
> @@ +17,5 @@
> > +class GfxState;
> > +class GooString;
> > +class Dict;
> > +
> > +struct MCOp {
>
> What's op? operator?</span >
Operation.
<span class="quote">> @@ +33,5 @@
> > + return ((Guint) (r * 255) & 0xFF) << 16
> > + | ((Guint) (g * 255) & 0xFF) << 8
> > + | ((Guint) (b * 255) & 0xFF);
> > + }
> > + };
>
> Why not using GfxColor? or even GfxRGB if only RGB is supported?</span >
GfxRGB seems a resonable compromise. Next version will use it.
<span class="quote">> @@ +67,5 @@
> > + }
> > + MCOp(): type(FontName), value(NULL) {}
> > + MCOp(Unicode u): type(Unichar), unichar(u) {}
> > + MCOp(const char *s): type(FontName), value(strdup(s)) {}
> > + MCOp(Type t, Guint f = 0): type(t), flags(f) {}
>
> Why do you need to pass also the Type for the flags constructor?</span >
Type “Unicode” is also a “Guint”, so there would be two overloaded
functions with one argument of the same type. The compiler cannot
resolve the ambiguity and generates an error. Adding the extra
“Type” argument is a workaround for that.
<span class="quote">> ::: poppler/StructElement.h
> @@ +233,5 @@
> > + //
> > + // The text will be appended to the passed GooString. If NULL is passed,
> > + // a new string is returned, and the ownership passed to the caller.
> > + //
> > + GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;
>
> appendText maybe?</span >
After thinking twice about this, it looks like it is wrong to have a
single function that can work in two ways. The reason for those two
behaviours is that client code will usually call “getText()”, without
passing arguments, so internally the function will create a GooString
and then iterate over the children recursively, passing that one string
to avoid allocating one GooString per child element.
IMHO the best is to split this in two functions:
public:
GooString* getText(GBool recursive = gTrue);
private:
void appendSubTreeText(GooString *string);
The new “appendSubTreeText()” function would be a private helper
function used when “getText()” is called with “recusive = gTrue”.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>