[Poppler-bugs] [Bug 64821] [TAGGEDPDF] Expose the structure tree and attributes in poppler-glib

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Jun 16 04:44:19 PDT 2013


https://bugs.freedesktop.org/show_bug.cgi?id=64821

--- Comment #8 from Carlos Garcia Campos <carlosgc at gnome.org> ---
Comment on attachment 80732
  --> https://bugs.freedesktop.org/attachment.cgi?id=80732
[PATCH v4 5/6] Tagged-PDF: Expose the structure tree in poppler-glib

Review of attachment 80732:
-----------------------------------------------------------------

We are very later in the release cycle, and I have a lot of doubts about the
API, so I think it'll be better to wait for the next cycle to add this API. We
can try to merge the logical structure and tagged pdf support in the core for
this cycle though.

For contents extracted from the document like the logical structure we usually
use an iterator based API, see fonts, index and layer for example. I wonder if
we should do the same for document structure, I'm not sure.

I would split this patch too.

::: glib/poppler-document.cc
@@ +1220,5 @@
>    return retval;
>  }
>  
> +/**
> + * poppler_document_get_structure:

What about using poppler_document_logical_structure? It's a bit longer but
makes clear this returns information about the logical structure of the PDF
document.

@@ +1224,5 @@
> + * poppler_document_get_structure:
> + * @document: A #PopplerDocument
> + *
> + * Returns the #PopplerStructure of the document. This object is owned by
> + * the called.

You should explain a briefly here what the document logical structure is

@@ +1226,5 @@
> + *
> + * Returns the #PopplerStructure of the document. This object is owned by
> + * the called.
> + *
> + * Return value: (transfer full): The #PopplerStructure of the document.

a #PopplerStructure object representing the logical structure of the document,
or %NULL

@@ +1236,5 @@
> +
> +  g_return_val_if_fail (POPPLER_IS_DOCUMENT (document), NULL);
> +
> +  tree_root = document->doc->getStructTreeRoot ();
> +  if (!tree_root) return NULL;

Use two lines for this please.

::: glib/poppler-private.h
@@ +112,5 @@
> +  StructElement *elem;
> +  gchar *id;
> +  gchar *title;
> +  gchar *text;
> +  gchar *text_r;

what does r stand for? please don't use abbreviations

::: glib/poppler-structure-element.cc
@@ +599,5 @@
> +}
> +
> +
> +/**
> + * SECTION:poppler-structure-element

Move this to the beginning.

@@ +617,5 @@
> +{
> +  GObjectClass parent_class;
> +};
> +
> +G_DEFINE_TYPE (PopplerStructureElement, poppler_structure_element, G_TYPE_OBJECT);

Trailing ; is not needed.

@@ +628,5 @@
> +
> +  g_assert (structure);
> +  g_assert (element);
> +
> +  poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);

What's the second NULL? I you are not providing constructor properties, a
single NULL is enough.

@@ +631,5 @@
> +
> +  poppler_structure_element = (PopplerStructureElement *) g_object_new (POPPLER_TYPE_STRUCTURE_ELEMENT, NULL, NULL);
> +  poppler_structure_element->text      = NULL;
> +  poppler_structure_element->text_r    = NULL;
> +  poppler_structure_element->children  = NULL;

GObjects are 0 filled when allocated, you can avoid all NULL/0 initializations.

@@ +636,5 @@
> +  poppler_structure_element->structure = structure;
> +  poppler_structure_element->elem      = element;
> +
> +  if (element->getNumElements ())
> +    poppler_structure_element->children = (PopplerStructureElement**) g_new0 (PopplerStructureElement*,

GPtrArray could be more convenient to use. You can use g_ptr_array_new_full
(element->getNumElements (), g_object_unref);

@@ +698,5 @@
> +/**
> + * poppler_structure_element_get_page:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Return value: Number of the page that contains the element, of

of -> or

@@ +702,5 @@
> + * Return value: Number of the page that contains the element, of
> + *    <code>-1</code> if not defined.
> + */
> +gint
> +poppler_structure_element_get_page (PopplerStructureElement *poppler_structure_element)

Use just element or structure_element to make it a bit shorter.

@@ +720,5 @@
> +/**
> + * poppler_structure_element_is_content:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Checks whether an element is actual document content.

You should explain here what document comments means here.

@@ +737,5 @@
> +/**
> + * poppler_structure_element_is_inline:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Checks whether an element is an inline element.

Ditto, what's an inline element?

@@ +754,5 @@
> +/**
> + * poppler_structure_element_is_block:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Checks whether an element is a block element.

Ditto.

@@ +771,5 @@
> +/**
> + * poppler_structure_element_get_n_children:
> + * @poppler_structure_element: A #PopplerStructureElement
> + *
> + * Gets the number of children of @structure_element.

argument is @poppler_structure_element not @structure_element. I'm fine with
the shorter name, but use it consistently.

@@ +791,5 @@
> + * @index: Index of the children element to obtain.
> + *
> + * Return value: (transfer none): A #PopplerStructureElement.
> + */
> +PopplerStructureElement*

PopplerStructureElement *

@@ +796,5 @@
> +poppler_structure_element_get_child (PopplerStructureElement *poppler_structure_element,
> +                                     guint                    index)
> +{
> +  g_return_val_if_fail (POPPLER_IS_STRUCTURE_ELEMENT (poppler_structure_element), NULL);
> +  g_assert (poppler_structure_element->elem);

This should not be needed, it's already checked when the object is created.

@@ +797,5 @@
> +                                     guint                    index)
> +{
> +  g_return_val_if_fail (POPPLER_IS_STRUCTURE_ELEMENT (poppler_structure_element), NULL);
> +  g_assert (poppler_structure_element->elem);
> +  g_assert (poppler_structure_element->elem->getNumElements () >= 0);

Use g_return_val_if_fail

@@ +816,5 @@
> + *
> + * Return value: (transfer none): The identifier of the element (if
> + *    defined), or %NULL.
> + */
> +const gchar*

const gchar *

@@ +1136,5 @@
> + * @attribute: A #PopperStructureAttribute value.
> + * @value (out): A #GValue in which to return the value of the attribute.
> + * @inherit: Whether to look up for inheritable attribute values in the
> + *    ancestors of the element, if the attribute is not defined in the
> + *    element.

Is this something the API user should decided? or are some attributes that are
inherited by definition in the spec?

@@ +1276,5 @@
> + */
> +GVariant*
> +poppler_structure_element_get_attribute (PopplerStructureElement  *poppler_structure_element,
> +                                         PopplerStructureAttribute attribute,
> +                                         gboolean                  inherit)

This kind of API where a single method does everything is not convenient to use
in the end, the use always have to deal with a GVariant. I prefer have
different methods, get_attribute_int, get_attribute_bool, get_attribute_color,
etc.

@@ +1450,5 @@
> + *
> + * Return value: Whether the element is a reference to another object.
> + */
> +gboolean
> +poppler_structure_element_is_reference (PopplerStructureElement *poppler_structure_element)

Is this really needed? Object reference are something internal to the PDF
structure that I'm not sure we want to expose in the API. Maybe we can merge
the content and reference type and have a get_content_type() or something
similar where content type can be link, annot, form field, contents (not sure
contents is the best name to expose marked content, though), etc.

@@ +1495,5 @@
> +  return reftype;
> +}
> +
> +/**
> + * poppler_structure_element_get_reference_link:

If I understand this correctly, the struct element corresponds to a link in the
document, right? Shouldn't it be poppler_structure_element_get_referenced_link,
since the links is referenced by the struct element?

@@ +1502,5 @@
> + * Return value: (transfer full): The #PopplerAnnot pointed by the object
> + *    reference, or %NULL of the element is not a reference pointing to
> + *    a #PopplerLink.
> + */
> +PopplerLinkMapping*

Why do we return a mapping? Struct tree elements are supposed to be the
document structure independently from where they are physically in the
document, no? To get the link mapping we have API for that. There's, however, a
bounding box attribute defined for tagged PDFs for illustration elements and
tables, but I guess it should be get as an attribute of the struct element.

@@ +1584,5 @@
> + * Return value: (transfer full): A #PopplerLinkMapping, or %NULL if the
> + *    link cannot be found.
> + */
> +PopplerLinkMapping*
> +poppler_structure_element_find_link (PopplerStructureElement *poppler_structure_element)

I think this is something that can be done with
poppler_structure_element_get_child  and
poppler_structure_element_get_reference_link. I would avoid this kind of
convenient methods for now, if we see this is commonly duplicated by API users
we can always add it later.

::: glib/poppler-structure-element.h
@@ +29,5 @@
> +#define POPPLER_STRUCTURE_ELEMENT(obj)    (G_TYPE_CHECK_INSTANCE_CAST ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT, PopplerStructureElement))
> +#define POPPLER_IS_STRUCTURE_ELEMENT(obj) (G_TYPE_CHECK_INSTANCE_TYPE ((obj), POPPLER_TYPE_STRUCTURE_ELEMENT))
> +
> +/**
> + * PopplerStructureElementKind:

Use Type instead of Kind for consistency with other enums like
PopplerActionType, PopplerDestType, PopplerAnnotType, etc.

::: glib/poppler-structure.cc
@@ +131,5 @@
> +
> +
> +
> +/**
> + * SECTION:poppler-structure

Move this section description to the beginning, after the includes.

@@ +164,5 @@
> +{
> +  GObjectClass parent_class;
> +};
> +
> +G_DEFINE_TYPE (PopplerStructure, poppler_structure, G_TYPE_OBJECT);

The trailing ; is not needed.

@@ +172,5 @@
> +poppler_structure_init (PopplerStructure *poppler_structure)
> +{
> +}
> +
> +

Leve only one empty line between functions

@@ +200,5 @@
> +  gobject_class->finalize = poppler_structure_finalize;
> +}
> +
> +
> +PopplerStructure*

PopplerStructure *

@@ +207,5 @@
> +{
> +  PopplerStructure *poppler_structure;
> +
> +  g_return_val_if_fail (POPPLER_IS_DOCUMENT (poppler_document), NULL);
> +  g_assert (root);

why g_return_val_if_fail for poppler_document but g_assert for root? Use
g_assert in both cases, since this is not public API method.

@@ +212,5 @@
> +
> +  poppler_structure = (PopplerStructure*) g_object_new (POPPLER_TYPE_STRUCTURE, NULL, NULL);
> +
> +  poppler_structure->document = (PopplerDocument*) g_object_ref (poppler_document);
> +  poppler_structure->root     = root;

Do not line up the =

@@ +213,5 @@
> +  poppler_structure = (PopplerStructure*) g_object_new (POPPLER_TYPE_STRUCTURE, NULL, NULL);
> +
> +  poppler_structure->document = (PopplerDocument*) g_object_ref (poppler_document);
> +  poppler_structure->root     = root;
> +  poppler_structure->children = NULL;

GObjects are alwats 0 filled when allocated, this is already NULL.

@@ +217,5 @@
> +  poppler_structure->children = NULL;
> +
> +  if (root->getNumElements ())
> +    poppler_structure->children = (PopplerStructureElement**) g_new0 (PopplerStructureElement*,
> +                                                                      root->getNumElements ());

A GPtrArray could be more convenient to use.

@@ +250,5 @@
> +poppler_structure_get_child (PopplerStructure *poppler_structure,
> +                             guint             index)
> +{
> +  g_return_val_if_fail (POPPLER_IS_STRUCTURE (poppler_structure), NULL);
> +  g_assert (poppler_structure->root);

I don't think this is needed, PopplerStructure object is created by poppler
internally with a StructTreeRoot and there's already an assert in
_poppler_structure_new to make usre it's created with a valid pointer.

@@ +251,5 @@
> +                             guint             index)
> +{
> +  g_return_val_if_fail (POPPLER_IS_STRUCTURE (poppler_structure), NULL);
> +  g_assert (poppler_structure->root);
> +  g_assert (poppler_structure->root->getNumElements () >= 0);

Use g_return_val_if_fail here too

@@ +258,5 @@
> +  if (!poppler_structure->children[index])
> +    {
> +      poppler_structure->children[index] = _poppler_structure_element_new (poppler_structure,
> +                                                                           poppler_structure->root->getElement (index));
> +      g_object_ref_sink (poppler_structure->children[index]);

Why are you doing this? PopplerStructureElement is a GObject not a
GInitiallyUnowned, so it's created with a real reference. You are leaking this
element.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20130616/12cb3dda/attachment-0001.html>


More information about the Poppler-bugs mailing list