<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Expose the structure tree and attributes in poppler-glib"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64821#c23">Comment # 23</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [TAGGEDPDF] Expose the structure tree and attributes in poppler-glib"
href="https://bugs.freedesktop.org/show_bug.cgi?id=64821">bug 64821</a>
from <span class="vcard"><a class="email" href="mailto:carlosgc@gnome.org" title="Carlos Garcia Campos <carlosgc@gnome.org>"> <span class="fn">Carlos Garcia Campos</span></a>
</span></b>
<pre>Comment on <span class=""><a href="attachment.cgi?id=86681" name="attach_86681" title="[PATCH v8 08/15] glib: Expose the document structure tree">attachment 86681</a> <a href="attachment.cgi?id=86681&action=edit" title="[PATCH v8 08/15] glib: Expose the document structure tree">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64821&attachment=86681'>[review]</a>
[PATCH v8 08/15] glib: Expose the document structure tree
Review of <span class=""><a href="attachment.cgi?id=86681" name="attach_86681" title="[PATCH v8 08/15] glib: Expose the document structure tree">attachment 86681</a> <a href="attachment.cgi?id=86681&action=edit" title="[PATCH v8 08/15] glib: Expose the document structure tree">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=64821&attachment=86681'>[review]</a>:
-----------------------------------------------------------------
The API looks good to me in general, I think we should not cache the text in
the StructElement, and documentation should be improved a bit. Thanks.
::: glib/poppler-structure-element.cc
@@ +44,5 @@
<span class="quote">> + * and poppler_structure_get_child() to enumerate the top level elements.
> + */
> +
> +static PopplerStructureElement *
> + _poppler_structure_element_new (PopplerDocument *document, StructElement *element);</span >
If this is only used here, simply move the code before it's used so that we
don't need the prototype. I think it makes sense to add all the iter
implementation after the StructElement impl.
@@ +70,5 @@
<span class="quote">> + *
> + * Creates a new #PopplerStructureElementIter as a copy of @iter. The
> + * returned value must be freed with poppler_structure_element_iter_free().
> + *
> + * Return value: a new #PopplerStructureElementIter</span >
(transfer full)
@@ +71,5 @@
<span class="quote">> + * Creates a new #PopplerStructureElementIter as a copy of @iter. The
> + * returned value must be freed with poppler_structure_element_iter_free().
> + *
> + * Return value: a new #PopplerStructureElementIter
> + */</span >
You should add Since: 0.26 to all doc comments.
@@ +105,5 @@
<span class="quote">> +/**
> + * poppler_structure_element_iter_new:
> + * @poppler_document: a #PopplerDocument.
> + *
> + * Returns the root #PopplerStructureElementIter for @document, or %NULL. The</span >
@document -> @poppler_document or better use just document in parameter name.
@@ +155,5 @@
<span class="quote">> +
> + iter = g_slice_new0 (PopplerStructureElementIter);
> + iter->document = (PopplerDocument *) g_object_ref (poppler_document);
> + iter->is_root = TRUE;
> + iter->root = root;</span >
Do not line up the =
@@ +190,5 @@
<span class="quote">> + * @iter: a #PopplerStructureElementIter
> + *
> + * Returns the #PopplerStructureElementIter associated with @iter.
> + *
> + * Return value: (transfer full): a new #PopplerStructureElementIter</span >
#PopplerStructureElementIter -> #PopplerStructureElement
@@ +210,5 @@
<span class="quote">> +/**
> + * poppler_structure_element_iter_get_child:
> + * @parent: a #PopplerStructureElementIter
> + *
> + * Return value: a new #PopplerStructureElementIter</span >
you should document this can return %NULL if the iter doesn't have any child
@@ +344,5 @@
<span class="quote">> +{
> + GObjectClass parent_class;
> +};
> +
> +G_DEFINE_TYPE (PopplerStructureElement, poppler_structure_element, G_TYPE_OBJECT);</span >
Unneeded trailing semicolon
@@ +355,5 @@
<span class="quote">> +
> + g_assert (POPPLER_IS_DOCUMENT (document));
> + g_assert (element);
> +
> + poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);</span >
What's the second NULL?
@@ +357,5 @@
<span class="quote">> + g_assert (element);
> +
> + poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);
> + poppler_structure_element->document = (PopplerDocument *) g_object_ref (document);
> + poppler_structure_element->elem = element;</span >
Do not line up the =
@@ +401,5 @@
<span class="quote">> + *
> + * Return value: A #PopplerStructureElementKind value.
> + */
> +PopplerStructureElementKind
> +poppler_structure_element_get_kind (PopplerStructureElement *poppler_structure_element)</span >
Use structure_element or even just element for the parameter to make everything
a bit shorter
@@ +413,5 @@
<span class="quote">> +/**
> + * poppler_structure_element_get_page:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: Number of the page that contains the element, of</span >
of -> or
@@ +437,5 @@
<span class="quote">> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Checks whether an element is actual document content.
> + *
> + * Return value: Whether the element is content.</span >
%TRUE if @poppler_structure_element is document content, or %FALSE otherwise.
And same pattern for all other is_foo methods
@@ +485,5 @@
<span class="quote">> +
> +/**
> + * poppler_structure_element_get_id:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *</span >
Maybe we could document here what the element identifier is for and why it's
optional?
@@ +487,5 @@
<span class="quote">> + * poppler_structure_element_get_id:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: (transfer none): The identifier of the element (if
> + * defined), or %NULL.</span >
Do not use the parentheses for if defined.
@@ +489,5 @@
<span class="quote">> + *
> + * Return value: (transfer none): The identifier of the element (if
> + * defined), or %NULL.
> + */
> +const gchar*</span >
gchar* -> gchar *
@@ +496,5 @@
<span class="quote">> + g_return_val_if_fail (POPPLER_IS_STRUCTURE_ELEMENT (poppler_structure_element), NULL);
> + g_assert (poppler_structure_element->elem);
> +
> + if (!poppler_structure_element->id && poppler_structure_element->elem->getID ())
> + poppler_structure_element->id = _poppler_goo_string_to_utf8 (poppler_structure_element->elem->getID ());</span >
I'm not sure about caching the values in the object. The PopplerStructElement
is mostly a read-only object, so users will typically convert them into a
different object and will have to duplicate the contents in any case. Or even
if the user wraps the PopplerStructElement in another object, she can still
cache the values in the wrapper instead. So, I would return the new allocated
string like we currently do in other apis like PopplerAnnot.
@@ +509,5 @@
<span class="quote">> + * Return value: (transfer none): The title of the element (if defined),
> + * or %NULL.
> + */
> +const gchar*
> +poppler_structure_element_get_title (PopplerStructureElement *poppler_structure_element)</span >
Same comments for this an all other methods returning a gchar *
@@ +525,5 @@
<span class="quote">> + * popppler_structure_element_get_abbreviation:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Acronyms and abbreviations contained in elements of type
> + * #POPPLER_STRUCTURE_ELEMENT_SPAN may have an associated expanded</span >
#POPPLER_STRUCTURE_ELEMENT_SPAN -> %POPPLER_STRUCTURE_ELEMENT_SPAN
@@ +529,5 @@
<span class="quote">> + * #POPPLER_STRUCTURE_ELEMENT_SPAN may have an associated expanded
> + * text form, which can be retrieved using this function.
> + *
> + * Return value: (transfer none): Text of the expanded abbreviation, if the
> + * element text is an abbreviation or acronym.</span >
or %NULL otherwise
@@ +551,5 @@
<span class="quote">> + * poppler_structure_element_get_language:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: (transfer none): language and country code, in two-letter
> + * ISO format, e.g. <code>en_US</code>, or %NULL if not defined.</span >
In general I prefer explanations in the body and keep the return tag simpler
@@ +633,5 @@
<span class="quote">> +/**
> + * poppler_structure_element_get_text:
> + * @poppler_structure_element: A #PopplerStructureElement
> + * @recursive: If %TRUE, the text of child elements is gathered recursively
> + * in logical order and returned as part of the result.</span >
Do we really want to expose this? I mean, why would someone pass FALSE here?
What about include_child or something like that? I'm not a fan of boolean
parameters in public api. Another possibilidy is making this function private
with the bool param, and expose two different methods
poppler_structure_element_get_text and poppler_structure_element_get_text_full
or using better names (get_text_including_child, get_text_recursive, ...)
@@ +635,5 @@
<span class="quote">> + * @poppler_structure_element: A #PopplerStructureElement
> + * @recursive: If %TRUE, the text of child elements is gathered recursively
> + * in logical order and returned as part of the result.
> + *
> + * Obtains the text enclosed by an element, or the subtree under an element.</span >
I think this is a bit confusing, when recursive is TRUE you get the text
enclosed by an element *and* the subtree, no? not the *or* the subtree. I would
say something like. When @recursive is %TRUE, the text of the child element is
gathered recursively in logical order and included as part of the result.
Keeping the parameter documentation simpler with just something like: whether
to include the text of the children, or something similar.
@@ +662,5 @@
<span class="quote">> + if (s)
> + poppler_structure_element->text = _poppler_goo_string_to_utf8 (s);
> + delete s;
> + }
> + return poppler_structure_element->text;</span >
Looking at the implementation and the fact that you are caching the values
separately (now I see that _r stands for recursive) I really think this should
be split in two different methods.
::: glib/poppler-structure-element.h
@@ +29,5 @@
<span class="quote">> +#define POPPLER_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_CAST ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT, PopplerStructureElement))
> +#define POPPLER_IS_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_TYPE ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT))
> +
> +/**
> + * PopplerStructureElementKind:</span >
Use Type instead of Kind for consistency with all other enums exposed in the
api.
@@ +30,5 @@
<span class="quote">> +#define POPPLER_IS_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_TYPE ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT))
> +
> +/**
> + * PopplerStructureElementKind:
> + */</span >
You should document every enum value here and add the Since tag
::: glib/reference/poppler-docs.sgml
@@ +23,5 @@
<span class="quote">> <xi:include href="xml/poppler-layer.xml"/>
> <xi:include href="xml/poppler-media.xml"/>
> <xi:include href="xml/poppler-movie.xml"/>
> + <xi:include href="xml/poppler-structure.xml"/>
> + <xi:include href="xml/poppler-structure-element.xml"/></span >
These are incorrectly indented.
::: glib/reference/poppler-sections.txt
@@ +536,5 @@
<span class="quote">>
> <SECTION>
> +<FILE>poppler-structure-element</FILE>
> +<TITLE>PopplerStructureElement</TITLE>
> +PopplerTextSpan</span >
Please, only add symbols exposed by this patch.
@@ +538,5 @@
<span class="quote">> +<FILE>poppler-structure-element</FILE>
> +<TITLE>PopplerStructureElement</TITLE>
> +PopplerTextSpan
> +PopplerStructureElement
> +PopplerStructureElementIter</span >
You could add the iter api in a anew subsection</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>