[poppler] How to read textbox positions?

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Sat Jan 27 08:17:37 UTC 2018


Dear Albert,

Oops, I apologize that I overlooked your response to
improve the patch.

> You have one text_box where you have 17 char bboxes but the ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash because you get an out of bounds.

Indeed, it might be common scenarios for the
programmers assuming the input as ASCII data but
something different is given.

> So it would be nice to make 
> rectf     char_bbox(int i) const;
> not crash on an invalid i, i guess return a 0x0 rect
> and then probably add some documentation about it?

Checking the features of C++ std::vector, I think
we have 2 options:

a) use (as current patch) m_data->char_bbox[i] and
sanitize the i-index before using it (as you proposed),
and the bad values cause 0x0 rectf.

b) use m_data->char_bbox.at(i) and let stdlib raise
std::out_of_range error.

which is better for C++ programmers?

> FWIW this was a real problem in Okular and i fixed it with 
> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cpp?id=dcf0e78227959a52300d8f253c4b1058b3e81567
> 
> Am i making any sense here?

Thank you very much for the important suggestion and
sorry for delayed reponse!

Regards,
mpsuzuki


Albert Astals Cid wrote:
> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va escriure:
>> Hello mpsuzuki,
>>
>> I do not think you need the two indirections: Since you want to keep
>> that m_data indirection for ABI compatibility, just return
>> std::vector<text_box> as text_box basically "is" a unique_ptr to its
>> data with the additional getter methods. (You just need to make sure to
>> default the move constructor and assignment operator since no copying
>> implies no moving by default.)
>>
>> Attached is an update patch along those lines. (I also took the liberty
>> to replace some bare deletes by use of unique_ptr in the implementation
>> of page::text_list as well.)
> 
> One thing that i wanted to improve in the Qt5 frontend but never go to, maybe we can start here is the fact that the number of boxes is "a bit random", for example take https://bugs.kde.org/attachment.cgi?id=66189
> 
> You have one text_box where you have 17 char bboxes but the ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash because you get an out of bounds.
> 
> So it would be nice to make 
> rectf     char_bbox(int i) const;
> not crash on an invalid i, i guess return a 0x0 rect
> and then probably add some documentation about it?
> 
> FWIW this was a real problem in Okular and i fixed it with 
> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cpp?id=dcf0e78227959a52300d8f253c4b1058b3e81567
> 
> Am i making any sense here?
> 
> Cheers,
>   Albert
> 
>> Best regards, Adam.
>>
>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya:
>>> Dear Adam,
>>>
>>> Thank you for the technical information on the difference of
>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace
>>> the vector of the pointers to that of the unique_ptrs, and
>>> remove the next_xxx methods.
>>>
>>> I attached the revised patch (gzipped) to the master, and
>>> here I quote the revised part from my last revision, for
>>> convenience to take a look on it.
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h
>>> index 726068a..2cedd1e 100644
>>> --- a/cpp/poppler-private.h
>>> +++ b/cpp/poppler-private.h
>>> @@ -72,12 +72,11 @@ class text_box_data
>>>
>>>  {
>>>  
>>>  public:
>>>      text_box_data()
>>>
>>> -    : next_text_box(0), has_space_after(false)
>>> +    : has_space_after(false)
>>>
>>>      {
>>>      }
>>>      ustring text;
>>>      rectf bbox;
>>>
>>> -    text_box *next_text_box;
>>>
>>>      std::vector<rectf> char_bboxes;
>>>      bool has_space_after;
>>>  
>>>  };
>>>
>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h
>>> index 18af213..3013997 100644
>>> --- a/cpp/poppler-page.h
>>> +++ b/cpp/poppler-page.h
>>> @@ -22,6 +22,8 @@
>>>
>>>  #include "poppler-global.h"
>>>  #include "poppler-rectangle.h"
>>>
>>> +#include <memory>
>>> +
>>>
>>>  namespace poppler
>>>  {
>>>
>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box {
>>>
>>>  	~text_box();
>>>  	ustring   text() const;
>>>  	rectf     bbox() const;
>>>
>>> -	text_box *next_text_box() const;
>>> -	text_box *next_word() { return this->next_text_box(); };
>>>
>>>  	rectf     char_bbox(int i) const;
>>>  	bool      has_space_after() const;
>>>  	
>>>      private:
>>> -	text_box_data* m_data;
>>> +	std::unique_ptr<text_box_data> m_data;
>>>
>>>  };
>>>  
>>>  class document;
>>>
>>> @@ -79,7 +79,7 @@ public:
>>>      ustring text(const rectf &rect = rectf()) const;
>>>      ustring text(const rectf &rect, text_layout_enum layout_mode) const;
>>>
>>> -    std::vector<text_box*> text_list(rotation_enum rotation) const;
>>> +    std::vector<std::unique_ptr<text_box>> text_list(rotation_enum
>>> rotation) const;> 
>>>  private:
>>>      page(document_private *doc, int index);
>>>
>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp
>>> index 9461ab9..5da2e1d 100644
>>> --- a/cpp/poppler-page.cpp
>>> +++ b/cpp/poppler-page.cpp
>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, text_layout_enum
>>> layout_mode) const
>>>
>>>   */
>>>  
>>>  text_box::text_box(const ustring& text, const rectf &bbox)
>>>  {
>>>
>>> -    m_data = new text_box_data();
>>> +    m_data = std::unique_ptr<text_box_data>(new text_box_data());
>>>
>>>      m_data->text = text;
>>>      m_data->bbox = bbox;
>>>  
>>>  }
>>>  
>>>  text_box::~text_box()
>>>  {
>>>
>>> -    delete m_data;
>>>
>>>  }
>>>  
>>>  ustring text_box::text() const
>>>
>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const
>>>
>>>      return m_data->bbox;
>>>  
>>>  }
>>>
>>> -text_box* text_box::next_text_box() const
>>> -{
>>> -    return m_data->next_text_box;
>>> -}
>>> -
>>>
>>>  rectf text_box::char_bbox(int i) const
>>>  {
>>>  
>>>      return m_data->char_bboxes[i];
>>>
>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const
>>>
>>>      return m_data->has_space_after;
>>>  
>>>  }
>>>
>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const
>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum
>>> rotate) const> 
>>>  {
>>>  
>>>      TextOutputDev *output_dev;
>>>
>>> -    std::vector<text_box*>  output_list;
>>> +    std::vector<std::unique_ptr<text_box>>  output_list;
>>>
>>>      const int rotation_value = (int)rotate * 90;
>>>      
>>>      /* config values are same with Qt5 Page::TextList() */
>>>
>>> @@ -366,7 +360,7 @@ std::vector<text_box*> page::text_list(rotation_enum
>>> rotate) const
>>>
>>>  	double xMin, yMin, xMax, yMax;
>>>  	word->getBBox(&xMin, &yMin, &xMax, &yMax);
>>>
>>> -	text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin,
>>> yMax-yMin)); +	std::unique_ptr<text_box> tb =
>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, xMax-xMin,
>>> yMax-yMin)));
>>>
>>>  	tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue);
>>>  	
>>>  	tb->m_data->char_bboxes.reserve(word->getLength());
>>>
>>> @@ -375,10 +369,7 @@ std::vector<text_box*> page::text_list(rotation_enum
>>> rotate) const
>>>
>>>  	    tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-xMin,
>>>  	    yMax-yMin));
>>>  	
>>>  	}
>>>
>>> -	if (output_list.size() > 0)
>>> -	    output_list.back()->m_data->next_text_box = tb;
>>> -
>>> -	output_list.push_back(tb);
>>> +	output_list.push_back(std::move(tb));
>>>
>>>      }
>>>      delete word_list;
>>>      delete output_dev;
>>>
>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp
>>> index 09b180d..d26a0ef 100644
>>> --- a/cpp/tests/poppler-dump.cpp
>>> +++ b/cpp/tests/poppler-dump.cpp
>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p)
>>>
>>>          std::cout << std::endl;
>>>          return;
>>>      
>>>      }
>>>
>>> -    std::vector<poppler::text_box*> text_list =
>>> p->text_list(poppler::rotate_0); +   
>>> std::vector<std::unique_ptr<poppler::text_box>> text_list =
>>> p->text_list(poppler::rotate_0);
>>>
>>>      std::cout << "---" << std::endl;
>>>      for (size_t i = 0; i < text_list.size(); i ++) {
>>>
>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p)
>>>
>>>          std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" <<
>>>
>>> bbox.width() << " h=" << bbox.height() << " )";
>>>
>>>          std::cout << std::endl;
>>>
>>> -        delete text_list[i];
>>>
>>>      }
>>>      std::cout << "---" << std::endl;
>>>  
>>>  }
>>>
>>> Adam Reichold wrote:
>>>> Hello mpsuzuki,
>>>>
>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 only
>>>> since a relatively short time, so I guess there just was no time to use
>>>> it anywhere.
>>>>
>>>> In any case it is a pretty heavy-handed solution due to the atomic
>>>> memory accesses on the reference counter which will slow down even
>>>> single-threaded programs which do not need it. Hence I would suggest
>>>> starting with std::unique_ptr to get a move-only type.
>>>>
>>>> Best regards, Adam.
>>>>
>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya:
>>>>> Dear Albert, Adam,
>>>>>
>>>>> Thank you for the helpful review. I should have more
>>>>> training on C++... Please let me spend to improve the
>>>>> patch.
>>>>>
>>>>>> (Another easy option albeit with some
>>>>>> overhead to get automatic destruction would be to wrap the actual data
>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>> Yeah, from some posts in stackoverflow, I found about
>>>>> shared_ptr... But it seems that current poppler does
>>>>> not use shared_ptr at all. Is this a coding policy?
>>>>>
>>>>> Regards,
>>>>> mpsuzuki
>>>>>
>>>>> Adam Reichold wrote:
>>>>>> Hello
>>>>>>
>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid:
>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki toshiya
>>>>>>> va
>>>>>>>
>>>>>>> escriure:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've tried to implement the suggestion, I attached my current patch.
>>>>>>>>
>>>>>>>> As suggested, the most part is just copied from Qt frontend and
>>>>>>>> renamed,
>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing,
>>>>>>>> because the returned object is a pointer to TextBox. I wrote
>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which
>>>>>>>> calls next_text_box() internally.
>>>>>>>>
>>>>>>>> Another point I want to discuss is the design of the list give by
>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() returns
>>>>>>>> QList<TextBox*>. For similarity, current patch returns
>>>>>>>> std::vector<text_box*> for similarity to Qt frontend.
>>>>>>>>
>>>>>>>> But, if we return the vector of pointers, the client should destruct
>>>>>>>> the objects pointed by the vector, before destructing vector itself.
>>>>>>>> Using a vector of text_box (not the pointer but the object itself),
>>>>>>>> like std::vector<text_box>, could be better, because the destructor
>>>>>>>> of the vector would internally call the destructor for text_box
>>>>>>>> object.
>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have such).
>>>>>>>> If I'm misunderstanding about C++, please correct.
>>>>>>> First you need to fix the text_box class, you either need to make it
>>>>>>> uncopiable (and thus making your suggestion not possible) or you need
>>>>>>> to make the copy and assignment operators not break your class since
>>>>>>> at the moment the default operators will just copy the pointer and
>>>>>>> everything will break because you will end up with a dangling
>>>>>>> pointer.
>>>>>>>
>>>>>>> If you go for the second option you'll want the move constructor and
>>>>>>> move
>>>>>>> operator too.
>>>>>> Should the first option also be viable if text_box becomes a move-only
>>>>>> type, i.e. no copying but move construction and assignment? I think
>>>>>> std::vector can handle it. (Another easy option albeit with some
>>>>>> overhead to get automatic destruction would be to wrap the actual data
>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>>>
>>>>>> Best regards, Adam.
>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>   Albert
>>>>>>>> Regards,
>>>>>>>> mpsuzuki
>>>>>>>>
>>>>>>>> Albert Astals Cid wrote:
>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen Ooms
>>>>>>>>> va
>>>>>>>>>
>>>>>>>>> escriure:
>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf
>>>>>>>>>> document,
>>>>>>>>>> including the position of each text box? Currently we use
>>>>>>>>>> page->text()
>>>>>>>>>> with page::physical_layout which gives all text per page, but I
>>>>>>>>>> need
>>>>>>>>>> more detailed information about each text box per page.
>>>>>>>>> You want to code the variant of qt5 frontend
>>>>>>>>> Poppler::Page::textList() for
>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration (i.e.
>>>>>>>>> almost-copying) the code, do you have time for it?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>>   Albert
>>>>>>>>>> _______________________________________________
>>>>>>>>>> poppler mailing list
>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>> _______________________________________________
>>>>>>>>> poppler mailing list
>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>> _______________________________________________
>>>>>>> poppler mailing list
>>>>>>> poppler at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 



More information about the poppler mailing list