[poppler] [RFC] Replace GooHash by std::unordered_map

Adam Reichold adam.reichold at t-online.de
Sat Mar 3 19:51:55 UTC 2018


Hello Albert,

please note that in the previous patch, I made a mistake with the move
constructor (the ref count should not be moved/swapped since it is tied
to the instance, not its value). Attached is a fixed patch.

I found the issue during further performance tests which however only
continued to strengthen to zero sum result. In any case, I extend the
Python-based perftest script with a C++-based driver to be able to
reliably measure single-page runtimes for this, which I will also attach
if anyone is interested in this.

Best regards, Adam.

Am 03.03.2018 um 20:29 schrieb Albert Astals Cid:
> Ha! didn't realize you had already done all that, will read/answer the other 
> email.
> 
> Cheers,
>   Albert
> 
> El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid va 
> escriure:
>> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
>>
>> escriure:
>>> Hello again,
>>>
>>> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
>>>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
>>>>
>>>> escriure:
>>>>> Hello again,
>>>>>
>>>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
>>>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
>>>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam Reichold
>>>>>>> va
>>>>>>>
>>>>>>> escriure:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> the attached patch replaced Poppler's internal hash table
>>>>>>>> implementation
>>>>>>>> GooHash by std::unordered_map found in the C++ standard library
>>>>>>>> since
>>>>>>>> C++11. This continues Poppler's slow drift towards standard library
>>>>>>>> containers and removes one of the home-grown data structures with
>>>>>>>> the
>>>>>>>> main goals of reducing code size and improving the long term
>>>>>>>> maintainability of the code base.
>>>>>>>
>>>>>>> Do you have any benchmarks on "rendering" speed and code size?
>>>>>>
>>>>>> Sorry, not at hand. I will try to prepare them during the week.
>>>>>
>>>>> I did run Splash rendering benchmarks of this branch against master
>>>>> with
>>>>> the result of rendering the circa 2400 documents of the TeXLive
>>>>
>>>>> documentation present on my machine being:
>>>> I'm wondering if those 2400 documents are diverse enough, which they may
>>>> not be given they are all coming from "the same place".
>>>
>>> They seem pretty diverse w.r.t. content, some being text heavy and some
>>> graphics rich. But I guess they are definitely not diverse technically
>>> as all are prepared using TeX itself.
>>>
>>> The main problem on my side is that I have failed to find my DVD copy of
>>> the Poppler regtest document collection until now. :-\ In any case, I am
>>> reasonably confident on the zero sum result since GooHash does not seem
>>> to live in any performance critical code path.
>>>
>>>>> Cumulative run time:
>>>>>         Result: 90.95 min ∓ 1.1 %
>>>>>         Reference: 91.57 min ∓ 1.2 %
>>>>>         Deviation: -0.0 %
>>>>>
>>>>> Cumulative memory usage:
>>>>>         Result: 37.2 MB ∓ 0.7 %
>>>>>         Reference: 37.0 MB ∓ 0.7 %
>>>>>         Deviation: +0.0 %
>>>>>
>>>>> (Where result is this patch and the reference is master.) (The
>>>>> measurement was taken using the perftest script which I proposed here
>>>>> some time ago for which I'll attach the patch again for
>>>>> reproduceability.)
>>>>
>>>> Did you really attach this before? i really don't remember it :D
>>>
>>> This was a very long-winded thread ending on 2016-01-01 centered around
>>> the regtest framework. It went from custom driver using QTest's
>>> benchmark functionality to custom backend in the regtest framework to
>>> separate Python script. The script still has problems to reliably
>>> measure really short things like running "pdftotext" for which a custom
>>> C++ driver might be needed, but for long-running stuff like "pdftoppm",
>>> the results seem reasonable.
>>>
>>>> Anyhow it seems the change is mostly a zero-sum runtime wise.
>>>
>>> Indeed. (Although thinking really really long term, I think a Poppler
>>> that is using std::vector, std::string and friends everywhere, could
>>> loose a lot of distributed fat in the form of a lot of memory
>>> indirections and benefit from the highly optimized standard library. But
>>> it just is not a single patch thing to reap any of these benefits.)
>>>
>>>> Which bring us to the code side. First i know it sucks but your spacing
>>>> is
>>>> broken, please check the whole patch
>>>>
>>>> -	nameToGID->removeInt(macGlyphNames[j]);
>>>> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
>>>> +          nameToGID[macGlyphNames[j]] = i;
>>>>
>>>> i could fix it, but it's better (for me) if you do :D
>>>
>>> Definitely for me to fix. The main problem is that the spacing in those
>>> files was already broken and I am unsure whether I should fix the
>>> surrounding spacing even if I am not touching it otherwise. Due to that,
>>> there sometimes is no correct way in the current patch. If you do not
>>> say otherwise, I will try to make at least the touched blocks of code
>>> consistent.
>>
>> Are you sure the spacing is broken? I'd say it's just xpdf weird spacing
>> rules of using 2 soft spaces and then hard tabs at 8.
>>
>>>> Now onto the code,
>>>>
>>>>   const auto emplaceRangeMap = [&](const char* encodingName, GBool
>>>>   unicodeOut,>
>>>>
>>>> UnicodeMapRange* ranges, int len) {
>>>>
>>>>     residentUnicodeMaps.emplace(
>>>>     
>>>>       std::piecewise_construct,
>>>>       std::forward_as_tuple(encodingName),
>>>>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
>>>>     
>>>>     );
>>>>   
>>>>   };
>>>>
>>>> It's the first time in my life i see std::piecewise_construct and
>>>> std::forward_as_tuple, yes that probably makes me a bad C++ developer,
>>>> but
>>>> given there's lots of us around, it doesn't make me happy now i don't
>>>> understand what the code does.
>>>
>>> The problem is that most internal Poppler types lack proper construction
>>> and assignment operators, hence I need to work harder to construct those
>>> UnicodeMap instances in-place inside the map (GooHash just stored a
>>> pointer to it, so there was no problem.)
>>>
>>> The whole piecewise_construct and forward_as_tuple dance just ensures,
>>> that those parameters to emplace are properly grouped before being
>>> unpacked to become the parameters of std::string and UnicodeMap
>>> construction again. If UnicodeMap was movable, this would probably look
>>> like
>>>
>>> residentUnicodeMaps.emplace(
>>>
>>>   encodingName,
>>>   UnicodeMap{encodingName, unicodeOut, ranges, len}
>>>
>>> );
>>>
>>> If you like, I can try to make Unicode a move-only type and simplify the
>>> mentioned helper functions?
>>
>> you can give it a try, not sure how easy that's going to be.
>>
>>>> Then there's the benefit/risk ratio. The code using GooHash is mostly
>>>> rocksolid and we haven't probably touched any line in it for a long time
>>>> and we have probably neither written new code that uses GooHash.
>>>>
>>>> In that regard we risk breaking working code.
>>>>
>>>> The benefit is not more speed nor less memory usage as your measurements
>>>> show.
>>>>
>>>> Mostly the benefit is "removing code from maintainership", which i agree
>>>> is a good thing, but as mentioned before it's some code "we mostly
>>>> ignore", so it's not that much of a good thing.
>>>
>>> I very much agree with the risk assessment.
>>>
>>> But I also think the code will ossify (or maybe already is?) due to
>>> those custom data structures and the less than idiomatic C++ usage.
>>> Hence I think, Poppler would not just loose code, but the remaining code
>>> should become easier to maintain. (Of course, the piecewise_construct
>>> fiasco shows that this has intermediate costs. But again, I think this
>>> is just an incentive to provide types with the usual C++ semantics which
>>> should make all code using those types better.)
>>
>> Yeah, i guess you're right.
>>
>> Cheers,
>>   Albert
>>
>>> Best regards, Adam.
>>>
>>>> So i'm hesitant of what to do. It really sounds like a nice thing to not
>>>> have custom structures, but on the other hand it's not like they get
>>>> much
>>>> in the way of coding.
>>>>
>>>> I'd really appreciate other people's comments on this.
>>>>
>>>> Cheers,
>>>>
>>>>   Albert
>>>>>
>>>>> I'll also attach the detailed comparison, but the gist seems to be that
>>>>> if there are significant changes, the run time is reduced but the
>>>>> memory
>>>>> usage is increased in the majority of cases. But this does not seem to
>>>>> show up in the cumulative results.
>>>>>
>>>>> Best regards, Adam.
>>>>>
>>>>> P.S.: One could try to improve the memory usage by tuning the load
>>>>> factor or calling shrink_to_fit where appropriate. Would you like me to
>>>>> try to do this?
>>>>>
>>>>> P.P.S.: One obvious area for improvement would be better
>>>>> interoperability between GooString and std::string, i.e. not converting
>>>>> them as C strings so that the length information does not need to be
>>>>> recomputed. I will try to prepare this as a separate patch on top of
>>>>> this one or should I include this here?
>>>>>
>>>>> Best regards, Adam.
>>>>>
>>>>>> Concerning code size, a release build of libpoppler.so goes from
>>>>>>
>>>>>>    text    data     bss     dec     hex filename
>>>>>>
>>>>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
>>>>>>
>>>>>> for the current master to
>>>>>>
>>>>>>    text    data     bss     dec     hex filename
>>>>>>
>>>>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
>>>>>>
>>>>>> with the patch applied, i.e. a 0.65% increase in binary size.
>>>>>>
>>>>>>
>>>>>> Please note that in my original message, I was referring only to
>>>>>> source
>>>>>> code size, i.e.
>>>>>>
>>>>>> git diff --stat master...remove-goo-hash
>>>>>>
>>>>>>  18 files changed, 168 insertions(+), 803 deletions(-)
>>>>>>  
>>>>>>> Cheers,
>>>>>>>
>>>>>>>   Albert
>>>>>>
>>>>>> Best regards, Adam.
>>>>>>
>>>>>>>> Best regards, Adam.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> poppler mailing list
>>>>>>> poppler at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>
>>>>>> _______________________________________________
>>>>>> poppler mailing list
>>>>>> poppler at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add-perftest-v2.patch
Type: text/x-patch
Size: 19056 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180303/b88e7f4c/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: replace-goo-hash-v3.patch
Type: text/x-patch
Size: 59643 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180303/b88e7f4c/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 525 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180303/b88e7f4c/attachment-0001.sig>


More information about the poppler mailing list