<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64815#c50">Comment # 50</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64815">bug 64815</a>
from <span class="vcard"><a class="email" href="mailto:carlosgc@gnome.org" title="Carlos Garcia Campos <carlosgc@gnome.org>"> <span class="fn">Carlos Garcia Campos</span></a>
</span></b>
<pre>Comment on <span class=""><a href="attachment.cgi?id=86678" name="attach_86678" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">attachment 86678</a> <a href="attachment.cgi?id=86678&action=edit" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64815&attachment=86678'>[review]</a>
[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
Review of <span class=""><a href="attachment.cgi?id=86678" name="attach_86678" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">attachment 86678</a> <a href="attachment.cgi?id=86678&action=edit" title="[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64815&attachment=86678'>[review]</a>:
-----------------------------------------------------------------
Looks good in general, my main concern is that some code is duplicated with
TextOutputDev, so I prefer if we could somehoe add support for marked content
to TextOutputDev instead of inventing yet another output dev. All those
abbreviations makes the code a bit more difficult to read for me.
::: poppler/MCOutputDev.cc
@@ +34,5 @@
<span class="quote">> +};
> +
> +
> +MCOutputDev::MCOutputDev(int mcid):
> + p(new Priv(mcid))</span >
Why do we need a heap allocated private struct?
@@ +66,5 @@
<span class="quote">> +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> +{
> + int id = -1;
> + if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> + p->capturing = true;</span >
capturing sounds a bit confusing to me, I would use something like
inMarkedContent
@@ +149,5 @@
<span class="quote">> + if (p->lastFlags != flags) {
> + if (p->capturing)
> + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> + p->lastFlags = flags;
> + }</span >
It seems you are not adding any color "op". Either remove the color type or add
it when it changes in ::drawChar (or beginWord)
@@ +150,5 @@
<span class="quote">> + if (p->capturing)
> + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> + p->lastFlags = flags;
> + }
> +}</span >
I wonder if we can integrate all this with TextOutputDev somehow. There's some
code duplication and this is about extracting text after all. Maybe we could
add a MarkedContentText object, similar to TextPage or ActualText objects, and
use it from the TextOutputDev. We could add a new constructor that only creates
a MarkedContentText and receives the marked content id, and then you can get
the text and attributes.
::: poppler/MCOutputDev.h
@@ +17,5 @@
<span class="quote">> +class GfxState;
> +class GooString;
> +class Dict;
> +
> +struct MCOp {</span >
What's op? operator?
@@ +33,5 @@
<span class="quote">> + return ((Guint) (r * 255) & 0xFF) << 16
> + | ((Guint) (g * 255) & 0xFF) << 8
> + | ((Guint) (b * 255) & 0xFF);
> + }
> + };</span >
Why not using GfxColor? or even GfxRGB if only RGB is supported?
@@ +67,5 @@
<span class="quote">> + }
> + MCOp(): type(FontName), value(NULL) {}
> + MCOp(Unicode u): type(Unichar), unichar(u) {}
> + MCOp(const char *s): type(FontName), value(strdup(s)) {}
> + MCOp(Type t, Guint f = 0): type(t), flags(f) {}</span >
Why do you need to pass also the Type for the flags constructor?
::: poppler/StructElement.h
@@ +233,5 @@
<span class="quote">> + //
> + // The text will be appended to the passed GooString. If NULL is passed,
> + // a new string is returned, and the ownership passed to the caller.
> + //
> + GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;</span >
appendText maybe?</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>