[poppler] [RFC] Replace GooHash by std::unordered_map
Adam Reichold
adam.reichold at t-online.de
Wed Feb 21 19:19:25 UTC 2018
Hello again,
attached is an updated version of the patch that:
* Improves the conversions between GooString and std::string.
* Fixes the spacing (And yes, the spacing wasn't broken. I was broken. I
just can't get my head around 2 spaces indentation with an 8 spaces tab.
Sorry for that.)
* Extends UnicodeMap to become a move-only type so that the
initialization can stay basically the same as when using GooHash storing
pointers to UnicodeMap.
Best regards, ADam.
Am 21.02.2018 um 07:13 schrieb Adam Reichold:
> Hello again,
>
> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
>> escriure:
>>> Hello again,
>>>
>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam Reichold va
>>>>>
>>>>> escriure:
>>>>>> Hello,
>>>>>>
>>>>>> the attached patch replaced Poppler's internal hash table implementation
>>>>>> GooHash by std::unordered_map found in the C++ standard library since
>>>>>> C++11. This continues Poppler's slow drift towards standard library
>>>>>> containers and removes one of the home-grown data structures with the
>>>>>> main goals of reducing code size and improving the long term
>>>>>> maintainability of the code base.
>>>>>
>>>>> Do you have any benchmarks on "rendering" speed and code size?
>>>>
>>>> Sorry, not at hand. I will try to prepare them during the week.
>>>
>>> I did run Splash rendering benchmarks of this branch against master with
>>> the result of rendering the circa 2400 documents of the TeXLive
>>> documentation present on my machine being:
>>
>> I'm wondering if those 2400 documents are diverse enough, which they may not
>> be given they are all coming from "the same place".
>
> They seem pretty diverse w.r.t. content, some being text heavy and some
> graphics rich. But I guess they are definitely not diverse technically
> as all are prepared using TeX itself.
>
> The main problem on my side is that I have failed to find my DVD copy of
> the Poppler regtest document collection until now. :-\ In any case, I am
> reasonably confident on the zero sum result since GooHash does not seem
> to live in any performance critical code path.
>
>>>
>>> Cumulative run time:
>>> Result: 90.95 min ∓ 1.1 %
>>> Reference: 91.57 min ∓ 1.2 %
>>> Deviation: -0.0 %
>>> Cumulative memory usage:
>>> Result: 37.2 MB ∓ 0.7 %
>>> Reference: 37.0 MB ∓ 0.7 %
>>> Deviation: +0.0 %
>>>
>>> (Where result is this patch and the reference is master.) (The
>>> measurement was taken using the perftest script which I proposed here
>>> some time ago for which I'll attach the patch again for reproduceability.)
>>
>> Did you really attach this before? i really don't remember it :D
>
> This was a very long-winded thread ending on 2016-01-01 centered around
> the regtest framework. It went from custom driver using QTest's
> benchmark functionality to custom backend in the regtest framework to
> separate Python script. The script still has problems to reliably
> measure really short things like running "pdftotext" for which a custom
> C++ driver might be needed, but for long-running stuff like "pdftoppm",
> the results seem reasonable.
>
>> Anyhow it seems the change is mostly a zero-sum runtime wise.
>
> Indeed. (Although thinking really really long term, I think a Poppler
> that is using std::vector, std::string and friends everywhere, could
> loose a lot of distributed fat in the form of a lot of memory
> indirections and benefit from the highly optimized standard library. But
> it just is not a single patch thing to reap any of these benefits.)
>
>> Which bring us to the code side. First i know it sucks but your spacing is
>> broken, please check the whole patch
>>
>> - nameToGID->removeInt(macGlyphNames[j]);
>> - nameToGID->add(new GooString(macGlyphNames[j]), i);
>> + nameToGID[macGlyphNames[j]] = i;
>>
>> i could fix it, but it's better (for me) if you do :D
>
> Definitely for me to fix. The main problem is that the spacing in those
> files was already broken and I am unsure whether I should fix the
> surrounding spacing even if I am not touching it otherwise. Due to that,
> there sometimes is no correct way in the current patch. If you do not
> say otherwise, I will try to make at least the touched blocks of code
> consistent.
>
>> Now onto the code,
>>
>> const auto emplaceRangeMap = [&](const char* encodingName, GBool unicodeOut,
>> UnicodeMapRange* ranges, int len) {
>> residentUnicodeMaps.emplace(
>> std::piecewise_construct,
>> std::forward_as_tuple(encodingName),
>> std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
>> );
>> };
>>
>> It's the first time in my life i see std::piecewise_construct and
>> std::forward_as_tuple, yes that probably makes me a bad C++ developer, but
>> given there's lots of us around, it doesn't make me happy now i don't
>> understand what the code does.
>
> The problem is that most internal Poppler types lack proper construction
> and assignment operators, hence I need to work harder to construct those
> UnicodeMap instances in-place inside the map (GooHash just stored a
> pointer to it, so there was no problem.)
>
> The whole piecewise_construct and forward_as_tuple dance just ensures,
> that those parameters to emplace are properly grouped before being
> unpacked to become the parameters of std::string and UnicodeMap
> construction again. If UnicodeMap was movable, this would probably look like
>
> residentUnicodeMaps.emplace(
> encodingName,
> UnicodeMap{encodingName, unicodeOut, ranges, len}
> );
>
> If you like, I can try to make Unicode a move-only type and simplify the
> mentioned helper functions?
>
>> Then there's the benefit/risk ratio. The code using GooHash is mostly
>> rocksolid and we haven't probably touched any line in it for a long time and
>> we have probably neither written new code that uses GooHash.
>>
>> In that regard we risk breaking working code.
>>
>> The benefit is not more speed nor less memory usage as your measurements show.
>>
>> Mostly the benefit is "removing code from maintainership", which i agree is a
>> good thing, but as mentioned before it's some code "we mostly ignore", so it's
>> not that much of a good thing.
>
> I very much agree with the risk assessment.
>
> But I also think the code will ossify (or maybe already is?) due to
> those custom data structures and the less than idiomatic C++ usage.
> Hence I think, Poppler would not just loose code, but the remaining code
> should become easier to maintain. (Of course, the piecewise_construct
> fiasco shows that this has intermediate costs. But again, I think this
> is just an incentive to provide types with the usual C++ semantics which
> should make all code using those types better.)
>
> Best regards, Adam.
>
>> So i'm hesitant of what to do. It really sounds like a nice thing to not have
>> custom structures, but on the other hand it's not like they get much in the
>> way of coding.
>>
>> I'd really appreciate other people's comments on this.
>>
>> Cheers,
>> Albert
>>
>>>
>>> I'll also attach the detailed comparison, but the gist seems to be that
>>> if there are significant changes, the run time is reduced but the memory
>>> usage is increased in the majority of cases. But this does not seem to
>>> show up in the cumulative results.
>>>
>>> Best regards, Adam.
>>>
>>> P.S.: One could try to improve the memory usage by tuning the load
>>> factor or calling shrink_to_fit where appropriate. Would you like me to
>>> try to do this?
>>>
>>> P.P.S.: One obvious area for improvement would be better
>>> interoperability between GooString and std::string, i.e. not converting
>>> them as C strings so that the length information does not need to be
>>> recomputed. I will try to prepare this as a separate patch on top of
>>> this one or should I include this here?
>>>
>>> Best regards, Adam.
>>>
>>>> Concerning code size, a release build of libpoppler.so goes from
>>>>
>>>> text data bss dec hex filename
>>>>
>>>> 2464034 288852 360 2753246 2a02de libpoppler.so.73.0.0
>>>>
>>>> for the current master to
>>>>
>>>> text data bss dec hex filename
>>>>
>>>> 2482129 288756 360 2771245 2a492d libpoppler.so.73.0.0
>>>>
>>>> with the patch applied, i.e. a 0.65% increase in binary size.
>>>>
>>>>
>>>> Please note that in my original message, I was referring only to source
>>>> code size, i.e.
>>>>
>>>> git diff --stat master...remove-goo-hash
>>>>
>>>> 18 files changed, 168 insertions(+), 803 deletions(-)
>>>>
>>>>> Cheers,
>>>>>
>>>>> Albert
>>>>
>>>> Best regards, Adam.
>>>>
>>>>>> Best regards, Adam.
>>>>>
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: remove-goo-hash-v2.patch
Type: text/x-patch
Size: 59752 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180221/edf5e360/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 525 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180221/edf5e360/attachment-0001.sig>
More information about the poppler
mailing list