[poppler] [RFC] Replace GooHash by std::unordered_map

Adam Reichold adam.reichold at t-online.de
Wed Feb 21 06:13:57 UTC 2018


Hello again,

Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va 
> escriure:
>> Hello again,
>>
>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam Reichold va
>>>>
>>>> escriure:
>>>>> Hello,
>>>>>
>>>>> the attached patch replaced Poppler's internal hash table implementation
>>>>> GooHash by std::unordered_map found in the C++ standard library since
>>>>> C++11. This continues Poppler's slow drift towards standard library
>>>>> containers and removes one of the home-grown data structures with the
>>>>> main goals of reducing code size and improving the long term
>>>>> maintainability of the code base.
>>>>
>>>> Do you have any benchmarks on "rendering" speed and code size?
>>>
>>> Sorry, not at hand. I will try to prepare them during the week.
>>
>> I did run Splash rendering benchmarks of this branch against master with
>> the result of rendering the circa 2400 documents of the TeXLive
>> documentation present on my machine being:
> 
> I'm wondering if those 2400 documents are diverse enough, which they may not 
> be given they are all coming from "the same place".

They seem pretty diverse w.r.t. content, some being text heavy and some
graphics rich. But I guess they are definitely not diverse technically
as all are prepared using TeX itself.

The main problem on my side is that I have failed to find my DVD copy of
the Poppler regtest document collection until now. :-\ In any case, I am
reasonably confident on the zero sum result since GooHash does not seem
to live in any performance critical code path.

>>
>> Cumulative run time:
>>         Result: 90.95 min ∓ 1.1 %
>>         Reference: 91.57 min ∓ 1.2 %
>>         Deviation: -0.0 %
>> Cumulative memory usage:
>>         Result: 37.2 MB ∓ 0.7 %
>>         Reference: 37.0 MB ∓ 0.7 %
>>         Deviation: +0.0 %
>>
>> (Where result is this patch and the reference is master.) (The
>> measurement was taken using the perftest script which I proposed here
>> some time ago for which I'll attach the patch again for reproduceability.)
> 
> Did you really attach this before? i really don't remember it :D

This was a very long-winded thread ending on 2016-01-01 centered around
the regtest framework. It went from custom driver using QTest's
benchmark functionality to custom backend in the regtest framework to
separate Python script. The script still has problems to reliably
measure really short things like running "pdftotext" for which a custom
C++ driver might be needed, but for long-running stuff like "pdftoppm",
the results seem reasonable.

> Anyhow it seems the change is mostly a zero-sum runtime wise. 

Indeed. (Although thinking really really long term, I think a Poppler
that is using std::vector, std::string and friends everywhere, could
loose a lot of distributed fat in the form of a lot of memory
indirections and benefit from the highly optimized standard library. But
it just is not a single patch thing to reap any of these benefits.)

> Which bring us to the code side. First i know it sucks but your spacing is 
> broken, please check the whole patch
> 
> -	nameToGID->removeInt(macGlyphNames[j]);
> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
> +          nameToGID[macGlyphNames[j]] = i;
> 
> i could fix it, but it's better (for me) if you do :D

Definitely for me to fix. The main problem is that the spacing in those
files was already broken and I am unsure whether I should fix the
surrounding spacing even if I am not touching it otherwise. Due to that,
there sometimes is no correct way in the current patch. If you do not
say otherwise, I will try to make at least the touched blocks of code
consistent.

> Now onto the code, 
> 
>   const auto emplaceRangeMap = [&](const char* encodingName, GBool unicodeOut, 
> UnicodeMapRange* ranges, int len) {
>     residentUnicodeMaps.emplace(
>       std::piecewise_construct,
>       std::forward_as_tuple(encodingName),
>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
>     );
>   };
> 
> It's the first time in my life i see std::piecewise_construct and 
> std::forward_as_tuple, yes that probably makes me a bad C++ developer, but 
> given there's lots of us around, it doesn't make me happy now i don't 
> understand what the code does.

The problem is that most internal Poppler types lack proper construction
and assignment operators, hence I need to work harder to construct those
UnicodeMap instances in-place inside the map (GooHash just stored a
pointer to it, so there was no problem.)

The whole piecewise_construct and forward_as_tuple dance just ensures,
that those parameters to emplace are properly grouped before being
unpacked to become the parameters of std::string and UnicodeMap
construction again. If UnicodeMap was movable, this would probably look like

residentUnicodeMaps.emplace(
  encodingName,
  UnicodeMap{encodingName, unicodeOut, ranges, len}
);

If you like, I can try to make Unicode a move-only type and simplify the
mentioned helper functions?

> Then there's the benefit/risk ratio. The code using GooHash is mostly 
> rocksolid and we haven't probably touched any line in it for a long time and 
> we have probably neither written new code that uses GooHash.
> 
> In that regard we risk breaking working code.
> 
> The benefit is not more speed nor less memory usage as your measurements show.
> 
> Mostly the benefit is "removing code from maintainership", which i agree is a 
> good thing, but as mentioned before it's some code "we mostly ignore", so it's 
> not that much of a good thing.

I very much agree with the risk assessment.

But I also think the code will ossify (or maybe already is?) due to
those custom data structures and the less than idiomatic C++ usage.
Hence I think, Poppler would not just loose code, but the remaining code
should become easier to maintain. (Of course, the piecewise_construct
fiasco shows that this has intermediate costs. But again, I think this
is just an incentive to provide types with the usual C++ semantics which
should make all code using those types better.)

Best regards, Adam.

> So i'm hesitant of what to do. It really sounds like a nice thing to not have 
> custom structures, but on the other hand it's not like they get much in the 
> way of coding.
> 
> I'd really appreciate other people's comments on this.
> 
> Cheers,
>   Albert
> 
>>
>> I'll also attach the detailed comparison, but the gist seems to be that
>> if there are significant changes, the run time is reduced but the memory
>> usage is increased in the majority of cases. But this does not seem to
>> show up in the cumulative results.
>>
>> Best regards, Adam.
>>
>> P.S.: One could try to improve the memory usage by tuning the load
>> factor or calling shrink_to_fit where appropriate. Would you like me to
>> try to do this?
>>
>> P.P.S.: One obvious area for improvement would be better
>> interoperability between GooString and std::string, i.e. not converting
>> them as C strings so that the length information does not need to be
>> recomputed. I will try to prepare this as a separate patch on top of
>> this one or should I include this here?
>>
>> Best regards, Adam.
>>
>>> Concerning code size, a release build of libpoppler.so goes from
>>>
>>>    text    data     bss     dec     hex filename
>>>
>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
>>>
>>> for the current master to
>>>
>>>    text    data     bss     dec     hex filename
>>>
>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
>>>
>>> with the patch applied, i.e. a 0.65% increase in binary size.
>>>
>>>
>>> Please note that in my original message, I was referring only to source
>>> code size, i.e.
>>>
>>> git diff --stat master...remove-goo-hash
>>>
>>>  18 files changed, 168 insertions(+), 803 deletions(-)
>>>  
>>>> Cheers,
>>>>
>>>>   Albert
>>>
>>> Best regards, Adam.
>>>
>>>>> Best regards, Adam.
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 525 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180221/e6bb5961/attachment.sig>


More information about the poppler mailing list