[poppler] [RFC] Replace GooHash by std::unordered_map

Adam Reichold adam.reichold at t-online.de
Wed May 2 17:11:19 UTC 2018


Hello,

Am 01.05.2018 um 23:57 schrieb Albert Astals Cid:
> El diumenge, 4 de març de 2018, a les 22:59:20 CEST, Albert Astals Cid va 
> escriure:
>> El dissabte, 3 de març de 2018, a les 20:51:55 CET, Adam Reichold va 
> escriure:
>>> Hello Albert,
>>>
>>> please note that in the previous patch, I made a mistake with the move
>>> constructor (the ref count should not be moved/swapped since it is tied
>>> to the instance, not its value). Attached is a fixed patch.
>>>
>>> I found the issue during further performance tests which however only
>>> continued to strengthen to zero sum result. In any case, I extend the
>>> Python-based perftest script with a C++-based driver to be able to
>>> reliably measure single-page runtimes for this, which I will also attach
>>> if anyone is interested in this.
>>
>> Ok, so it seems there's a mild agreement (or at least no strong
>> disagreement) that this may be the way forward. I will first wait to get
>> poppler 0.63 out of the door (that is already late several months, between
>> freedesktop shutting down ssh access, them forgetting to tell us it was
>> open again, my disk dying and then me procastinating as usual) and once
>> it's out I'll come back and reevaluate this.
> 
> I was trying to apply the patch for running a quick regtest but running git am 
> over it gave me some warning/error I wasn't sure how to get out of correctly.
> 
> Could you try rebasing the patch on top of current master?
> 
> Cheers,
>   Albert

Attached is a version rebased against the current master.

Best regards,
Adam

>>
>> Cheers,
>>   Albert
>>
>>> Best regards, Adam.
>>>
>>> Am 03.03.2018 um 20:29 schrieb Albert Astals Cid:
>>>> Ha! didn't realize you had already done all that, will read/answer the
>>>> other email.
>>>>
>>>> Cheers,
>>>>
>>>>   Albert
>>>>
>>>> El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid va
>>>>
>>>> escriure:
>>>>> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
>>>>>
>>>>> escriure:
>>>>>> Hello again,
>>>>>>
>>>>>> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
>>>>>>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
>>>>>>>
>>>>>>> escriure:
>>>>>>>> Hello again,
>>>>>>>>
>>>>>>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
>>>>>>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
>>>>>>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam
>>>>>>>>>> Reichold
>>>>>>>>>> va
>>>>>>>>>>
>>>>>>>>>> escriure:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> the attached patch replaced Poppler's internal hash table
>>>>>>>>>>> implementation
>>>>>>>>>>> GooHash by std::unordered_map found in the C++ standard library
>>>>>>>>>>> since
>>>>>>>>>>> C++11. This continues Poppler's slow drift towards standard
>>>>>>>>>>> library
>>>>>>>>>>> containers and removes one of the home-grown data structures with
>>>>>>>>>>> the
>>>>>>>>>>> main goals of reducing code size and improving the long term
>>>>>>>>>>> maintainability of the code base.
>>>>>>>>>>
>>>>>>>>>> Do you have any benchmarks on "rendering" speed and code size?
>>>>>>>>>
>>>>>>>>> Sorry, not at hand. I will try to prepare them during the week.
>>>>>>>>
>>>>>>>> I did run Splash rendering benchmarks of this branch against master
>>>>>>>> with
>>>>>>>> the result of rendering the circa 2400 documents of the TeXLive
>>>>>>>
>>>>>>>> documentation present on my machine being:
>>>>>>> I'm wondering if those 2400 documents are diverse enough, which they
>>>>>>> may
>>>>>>> not be given they are all coming from "the same place".
>>>>>>
>>>>>> They seem pretty diverse w.r.t. content, some being text heavy and
>>>>>> some
>>>>>> graphics rich. But I guess they are definitely not diverse technically
>>>>>> as all are prepared using TeX itself.
>>>>>>
>>>>>> The main problem on my side is that I have failed to find my DVD copy
>>>>>> of
>>>>>> the Poppler regtest document collection until now. :-\ In any case, I
>>>>>> am
>>>>>> reasonably confident on the zero sum result since GooHash does not
>>>>>> seem
>>>>>> to live in any performance critical code path.
>>>>>>
>>>>>>>> Cumulative run time:
>>>>>>>>         Result: 90.95 min ∓ 1.1 %
>>>>>>>>         Reference: 91.57 min ∓ 1.2 %
>>>>>>>>         Deviation: -0.0 %
>>>>>>>>
>>>>>>>> Cumulative memory usage:
>>>>>>>>         Result: 37.2 MB ∓ 0.7 %
>>>>>>>>         Reference: 37.0 MB ∓ 0.7 %
>>>>>>>>         Deviation: +0.0 %
>>>>>>>>
>>>>>>>> (Where result is this patch and the reference is master.) (The
>>>>>>>> measurement was taken using the perftest script which I proposed
>>>>>>>> here
>>>>>>>> some time ago for which I'll attach the patch again for
>>>>>>>> reproduceability.)
>>>>>>>
>>>>>>> Did you really attach this before? i really don't remember it :D
>>>>>>
>>>>>> This was a very long-winded thread ending on 2016-01-01 centered
>>>>>> around
>>>>>> the regtest framework. It went from custom driver using QTest's
>>>>>> benchmark functionality to custom backend in the regtest framework to
>>>>>> separate Python script. The script still has problems to reliably
>>>>>> measure really short things like running "pdftotext" for which a
>>>>>> custom
>>>>>> C++ driver might be needed, but for long-running stuff like
>>>>>> "pdftoppm",
>>>>>> the results seem reasonable.
>>>>>>
>>>>>>> Anyhow it seems the change is mostly a zero-sum runtime wise.
>>>>>>
>>>>>> Indeed. (Although thinking really really long term, I think a Poppler
>>>>>> that is using std::vector, std::string and friends everywhere, could
>>>>>> loose a lot of distributed fat in the form of a lot of memory
>>>>>> indirections and benefit from the highly optimized standard library.
>>>>>> But
>>>>>> it just is not a single patch thing to reap any of these benefits.)
>>>>>>
>>>>>>> Which bring us to the code side. First i know it sucks but your
>>>>>>> spacing
>>>>>>> is
>>>>>>> broken, please check the whole patch
>>>>>>>
>>>>>>> -	nameToGID->removeInt(macGlyphNames[j]);
>>>>>>> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
>>>>>>> +          nameToGID[macGlyphNames[j]] = i;
>>>>>>>
>>>>>>> i could fix it, but it's better (for me) if you do :D
>>>>>>
>>>>>> Definitely for me to fix. The main problem is that the spacing in
>>>>>> those
>>>>>> files was already broken and I am unsure whether I should fix the
>>>>>> surrounding spacing even if I am not touching it otherwise. Due to
>>>>>> that,
>>>>>> there sometimes is no correct way in the current patch. If you do not
>>>>>> say otherwise, I will try to make at least the touched blocks of code
>>>>>> consistent.
>>>>>
>>>>> Are you sure the spacing is broken? I'd say it's just xpdf weird
>>>>> spacing
>>>>> rules of using 2 soft spaces and then hard tabs at 8.
>>>>>
>>>>>>> Now onto the code,
>>>>>>>
>>>>>>>   const auto emplaceRangeMap = [&](const char* encodingName, GBool
>>>>>>>   unicodeOut,>
>>>>>>>
>>>>>>> UnicodeMapRange* ranges, int len) {
>>>>>>>
>>>>>>>     residentUnicodeMaps.emplace(
>>>>>>>     
>>>>>>>       std::piecewise_construct,
>>>>>>>       std::forward_as_tuple(encodingName),
>>>>>>>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
>>>>>>>     
>>>>>>>     );
>>>>>>>   
>>>>>>>   };
>>>>>>>
>>>>>>> It's the first time in my life i see std::piecewise_construct and
>>>>>>> std::forward_as_tuple, yes that probably makes me a bad C++
>>>>>>> developer,
>>>>>>> but
>>>>>>> given there's lots of us around, it doesn't make me happy now i don't
>>>>>>> understand what the code does.
>>>>>>
>>>>>> The problem is that most internal Poppler types lack proper
>>>>>> construction
>>>>>> and assignment operators, hence I need to work harder to construct
>>>>>> those
>>>>>> UnicodeMap instances in-place inside the map (GooHash just stored a
>>>>>> pointer to it, so there was no problem.)
>>>>>>
>>>>>> The whole piecewise_construct and forward_as_tuple dance just ensures,
>>>>>> that those parameters to emplace are properly grouped before being
>>>>>> unpacked to become the parameters of std::string and UnicodeMap
>>>>>> construction again. If UnicodeMap was movable, this would probably
>>>>>> look
>>>>>> like
>>>>>>
>>>>>> residentUnicodeMaps.emplace(
>>>>>>
>>>>>>   encodingName,
>>>>>>   UnicodeMap{encodingName, unicodeOut, ranges, len}
>>>>>>
>>>>>> );
>>>>>>
>>>>>> If you like, I can try to make Unicode a move-only type and simplify
>>>>>> the
>>>>>> mentioned helper functions?
>>>>>
>>>>> you can give it a try, not sure how easy that's going to be.
>>>>>
>>>>>>> Then there's the benefit/risk ratio. The code using GooHash is mostly
>>>>>>> rocksolid and we haven't probably touched any line in it for a long
>>>>>>> time
>>>>>>> and we have probably neither written new code that uses GooHash.
>>>>>>>
>>>>>>> In that regard we risk breaking working code.
>>>>>>>
>>>>>>> The benefit is not more speed nor less memory usage as your
>>>>>>> measurements
>>>>>>> show.
>>>>>>>
>>>>>>> Mostly the benefit is "removing code from maintainership", which i
>>>>>>> agree
>>>>>>> is a good thing, but as mentioned before it's some code "we mostly
>>>>>>> ignore", so it's not that much of a good thing.
>>>>>>
>>>>>> I very much agree with the risk assessment.
>>>>>>
>>>>>> But I also think the code will ossify (or maybe already is?) due to
>>>>>> those custom data structures and the less than idiomatic C++ usage.
>>>>>> Hence I think, Poppler would not just loose code, but the remaining
>>>>>> code
>>>>>> should become easier to maintain. (Of course, the piecewise_construct
>>>>>> fiasco shows that this has intermediate costs. But again, I think this
>>>>>> is just an incentive to provide types with the usual C++ semantics
>>>>>> which
>>>>>> should make all code using those types better.)
>>>>>
>>>>> Yeah, i guess you're right.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>   Albert
>>>>>>
>>>>>> Best regards, Adam.
>>>>>>
>>>>>>> So i'm hesitant of what to do. It really sounds like a nice thing to
>>>>>>> not
>>>>>>> have custom structures, but on the other hand it's not like they get
>>>>>>> much
>>>>>>> in the way of coding.
>>>>>>>
>>>>>>> I'd really appreciate other people's comments on this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>   Albert
>>>>>>>>
>>>>>>>> I'll also attach the detailed comparison, but the gist seems to be
>>>>>>>> that
>>>>>>>> if there are significant changes, the run time is reduced but the
>>>>>>>> memory
>>>>>>>> usage is increased in the majority of cases. But this does not seem
>>>>>>>> to
>>>>>>>> show up in the cumulative results.
>>>>>>>>
>>>>>>>> Best regards, Adam.
>>>>>>>>
>>>>>>>> P.S.: One could try to improve the memory usage by tuning the load
>>>>>>>> factor or calling shrink_to_fit where appropriate. Would you like me
>>>>>>>> to
>>>>>>>> try to do this?
>>>>>>>>
>>>>>>>> P.P.S.: One obvious area for improvement would be better
>>>>>>>> interoperability between GooString and std::string, i.e. not
>>>>>>>> converting
>>>>>>>> them as C strings so that the length information does not need to be
>>>>>>>> recomputed. I will try to prepare this as a separate patch on top of
>>>>>>>> this one or should I include this here?
>>>>>>>>
>>>>>>>> Best regards, Adam.
>>>>>>>>
>>>>>>>>> Concerning code size, a release build of libpoppler.so goes from
>>>>>>>>>
>>>>>>>>>    text    data     bss     dec     hex filename
>>>>>>>>>
>>>>>>>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
>>>>>>>>>
>>>>>>>>> for the current master to
>>>>>>>>>
>>>>>>>>>    text    data     bss     dec     hex filename
>>>>>>>>>
>>>>>>>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
>>>>>>>>>
>>>>>>>>> with the patch applied, i.e. a 0.65% increase in binary size.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please note that in my original message, I was referring only to
>>>>>>>>> source
>>>>>>>>> code size, i.e.
>>>>>>>>>
>>>>>>>>> git diff --stat master...remove-goo-hash
>>>>>>>>>
>>>>>>>>>  18 files changed, 168 insertions(+), 803 deletions(-)
>>>>>>>>>  
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>>   Albert
>>>>>>>>>
>>>>>>>>> Best regards, Adam.
>>>>>>>>>
>>>>>>>>>>> Best regards, Adam.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> poppler mailing list
>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> poppler mailing list
>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> poppler mailing list
>>>>>>> poppler at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: remove-goo-hash-rebased.diff
Type: text/x-patch
Size: 60005 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180502/5e9952e1/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180502/5e9952e1/attachment-0001.sig>


More information about the poppler mailing list