[Poppler-bugs] [Bug 64821] [TAGGEDPDF] Expose the structure tree and attributes in poppler-glib

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Oct 15 01:53:38 PDT 2013


https://bugs.freedesktop.org/show_bug.cgi?id=64821

--- Comment #23 from Carlos Garcia Campos <carlosgc at gnome.org> ---
Comment on attachment 86681
  --> https://bugs.freedesktop.org/attachment.cgi?id=86681
[PATCH v8 08/15] glib: Expose the document structure tree

Review of attachment 86681:
-----------------------------------------------------------------

The API looks good to me in general, I think we should not cache the text in
the StructElement, and documentation should be improved a bit. Thanks.

::: glib/poppler-structure-element.cc
@@ +44,5 @@
> + * and poppler_structure_get_child() to enumerate the top level elements.
> + */
> +
> +static PopplerStructureElement *
> +  _poppler_structure_element_new (PopplerDocument *document, StructElement *element);

If this is only used here, simply move the code before it's used so that we
don't need the prototype. I think it makes sense to add all the iter
implementation after the StructElement impl.

@@ +70,5 @@
> + *
> + * Creates a new #PopplerStructureElementIter as a copy of @iter. The
> + * returned value must be freed with poppler_structure_element_iter_free().
> + *
> + * Return value: a new #PopplerStructureElementIter

(transfer full)

@@ +71,5 @@
> + * Creates a new #PopplerStructureElementIter as a copy of @iter. The
> + * returned value must be freed with poppler_structure_element_iter_free().
> + *
> + * Return value: a new #PopplerStructureElementIter
> + */

You should add Since: 0.26 to all doc comments.

@@ +105,5 @@
> +/**
> + * poppler_structure_element_iter_new:
> + * @poppler_document: a #PopplerDocument.
> + *
> + * Returns the root #PopplerStructureElementIter for @document, or %NULL. The

@document -> @poppler_document or better use just document in parameter name.

@@ +155,5 @@
> +
> +  iter = g_slice_new0 (PopplerStructureElementIter);
> +  iter->document = (PopplerDocument *) g_object_ref (poppler_document);
> +  iter->is_root  = TRUE;
> +  iter->root     = root;

Do not line up the =

@@ +190,5 @@
> + * @iter: a #PopplerStructureElementIter
> + *
> + * Returns the #PopplerStructureElementIter associated with @iter.
> + *
> + * Return value: (transfer full): a new #PopplerStructureElementIter

#PopplerStructureElementIter -> #PopplerStructureElement

@@ +210,5 @@
> +/**
> + * poppler_structure_element_iter_get_child:
> + * @parent: a #PopplerStructureElementIter
> + *
> + * Return value: a new #PopplerStructureElementIter

you should document this can return %NULL if the iter doesn't have any child

@@ +344,5 @@
> +{
> +  GObjectClass parent_class;
> +};
> +
> +G_DEFINE_TYPE (PopplerStructureElement, poppler_structure_element, G_TYPE_OBJECT);

Unneeded trailing semicolon

@@ +355,5 @@
> +
> +  g_assert (POPPLER_IS_DOCUMENT (document));
> +  g_assert (element);
> +
> +  poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);

What's the second NULL?

@@ +357,5 @@
> +  g_assert (element);
> +
> +  poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);
> +  poppler_structure_element->document  = (PopplerDocument *) g_object_ref (document);
> +  poppler_structure_element->elem      = element;

Do not line up the =

@@ +401,5 @@
> + *
> + * Return value: A #PopplerStructureElementKind value.
> + */
> +PopplerStructureElementKind
> +poppler_structure_element_get_kind (PopplerStructureElement *poppler_structure_element)

Use structure_element or even just element for the parameter to make everything
a bit shorter

@@ +413,5 @@
> +/**
> + * poppler_structure_element_get_page:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: Number of the page that contains the element, of

of -> or

@@ +437,5 @@
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Checks whether an element is actual document content.
> + *
> + * Return value: Whether the element is content.

%TRUE if @poppler_structure_element is document content, or %FALSE otherwise.
And same pattern for all other is_foo methods

@@ +485,5 @@
> +
> +/**
> + * poppler_structure_element_get_id:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *

Maybe we could document here what the element identifier is for and why it's
optional?

@@ +487,5 @@
> + * poppler_structure_element_get_id:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: (transfer none): The identifier of the element (if
> + *    defined), or %NULL.

Do not use the parentheses for if defined.

@@ +489,5 @@
> + *
> + * Return value: (transfer none): The identifier of the element (if
> + *    defined), or %NULL.
> + */
> +const gchar*

gchar* -> gchar *

@@ +496,5 @@
> +  g_return_val_if_fail (POPPLER_IS_STRUCTURE_ELEMENT (poppler_structure_element), NULL);
> +  g_assert (poppler_structure_element->elem);
> +
> +  if (!poppler_structure_element->id && poppler_structure_element->elem->getID ())
> +    poppler_structure_element->id = _poppler_goo_string_to_utf8 (poppler_structure_element->elem->getID ());

I'm not sure about caching the values in the object. The PopplerStructElement
is mostly a read-only object, so users will typically convert them into a
different object and will have to duplicate the contents in any case. Or even
if the user wraps the PopplerStructElement in another object, she can still
cache the values in the wrapper instead. So, I would return the new allocated
string like we currently do in other apis like PopplerAnnot.

@@ +509,5 @@
> + * Return value: (transfer none): The title of the element (if defined),
> + *    or %NULL.
> + */
> +const gchar*
> +poppler_structure_element_get_title (PopplerStructureElement *poppler_structure_element)

Same comments for this an all other methods returning a gchar *

@@ +525,5 @@
> + * popppler_structure_element_get_abbreviation:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Acronyms and abbreviations contained in elements of type
> + * #POPPLER_STRUCTURE_ELEMENT_SPAN may have an associated expanded

#POPPLER_STRUCTURE_ELEMENT_SPAN -> %POPPLER_STRUCTURE_ELEMENT_SPAN

@@ +529,5 @@
> + * #POPPLER_STRUCTURE_ELEMENT_SPAN may have an associated expanded
> + * text form, which can be retrieved using this function.
> + *
> + * Return value: (transfer none): Text of the expanded abbreviation, if the
> + *    element text is an abbreviation or acronym.

or %NULL otherwise

@@ +551,5 @@
> + * poppler_structure_element_get_language:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: (transfer none): language and country code, in two-letter
> + *    ISO format, e.g. <code>en_US</code>, or %NULL if not defined.

In general I prefer explanations in the body and keep the return tag simpler

@@ +633,5 @@
> +/**
> + * poppler_structure_element_get_text:
> + * @poppler_structure_element: A #PopplerStructureElement
> + * @recursive: If %TRUE, the text of child elements is gathered recursively
> + *   in logical order and returned as part of the result.

Do we really want to expose this? I mean, why would someone pass FALSE here?
What about include_child or something like that? I'm not a fan of boolean
parameters in public api. Another possibilidy is making this function private
with the bool param, and expose two different methods
poppler_structure_element_get_text and poppler_structure_element_get_text_full
or using better names (get_text_including_child, get_text_recursive, ...)

@@ +635,5 @@
> + * @poppler_structure_element: A #PopplerStructureElement
> + * @recursive: If %TRUE, the text of child elements is gathered recursively
> + *   in logical order and returned as part of the result.
> + *
> + * Obtains the text enclosed by an element, or the subtree under an element.

I think this is a bit confusing, when recursive is TRUE you get the text
enclosed by an element *and* the subtree, no? not the *or* the subtree. I would
say something like. When @recursive is %TRUE, the text of the child element is
gathered recursively in logical order and included as part of the result.
Keeping the parameter documentation simpler with just something like: whether
to include the text of the children, or something similar.

@@ +662,5 @@
> +      if (s)
> +        poppler_structure_element->text = _poppler_goo_string_to_utf8 (s);
> +      delete s;
> +    }
> +  return poppler_structure_element->text;

Looking at the implementation and the fact that you are caching the values
separately (now I see that _r stands for recursive) I really think this should
be split in two different methods.

::: glib/poppler-structure-element.h
@@ +29,5 @@
> +#define POPPLER_STRUCTURE_ELEMENT(obj)    (G_TYPE_CHECK_INSTANCE_CAST ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT, PopplerStructureElement))
> +#define POPPLER_IS_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_TYPE ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT))
> +
> +/**
> + * PopplerStructureElementKind:

Use Type instead of Kind for consistency with all other enums exposed in the
api.

@@ +30,5 @@
> +#define POPPLER_IS_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_TYPE ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT))
> +
> +/**
> + * PopplerStructureElementKind:
> + */

You should document every enum value here and add the Since tag

::: glib/reference/poppler-docs.sgml
@@ +23,5 @@
>      <xi:include href="xml/poppler-layer.xml"/>
>      <xi:include href="xml/poppler-media.xml"/>
>      <xi:include href="xml/poppler-movie.xml"/>
> +		<xi:include href="xml/poppler-structure.xml"/>
> +		<xi:include href="xml/poppler-structure-element.xml"/>

These are incorrectly indented.

::: glib/reference/poppler-sections.txt
@@ +536,5 @@
>  
>  <SECTION>
> +<FILE>poppler-structure-element</FILE>
> +<TITLE>PopplerStructureElement</TITLE>
> +PopplerTextSpan

Please, only add symbols exposed by this patch.

@@ +538,5 @@
> +<FILE>poppler-structure-element</FILE>
> +<TITLE>PopplerStructureElement</TITLE>
> +PopplerTextSpan
> +PopplerStructureElement
> +PopplerStructureElementIter

You could add the iter api in a anew subsection

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131015/98183eea/attachment-0001.html>


More information about the Poppler-bugs mailing list