[pulseaudio-discuss] RFC: pa_database to adopt hdf5?

Mon Jun 24 20:49:01 PDT 2013

On Mon, Jun 24, 2013 at 7:59 AM, Tanu Kaskinen
<tanu.kaskinen at linux.intel.com> wrote:
>
>
> According to [1], libhdf5 installed size is multiple megabytes, so any
> disk space savings would be negated by the necessity to install the new
> library.
>
> [1] http://packages.debian.org/jessie/libhdf5-7

You definitely have a point - it is a hefty c library and it's quite
likely that it's not going to be a shared dependency in most settings.
  tdb / gdbm are lighter weight and on the order of a 10th of a
megabyte each, by comparison but have the mentioned shortcomings of
opaque data storage only.

>
> > Additionally, external tools like matlab/pytables can read this
> > structured data and make sense of it without the user having to do
> > much work beyond opening the object paths and one can use the
> > HDFViewer to quickly inspect all the inner contents of an hdf file.
> >
> > Also, I'll mention that HDF5 uses a global lock to stay threadsafe and
> > does allow multiple writes concurrently to the same file from the api,
> > but because of this it does not scale very well, a frequent point of
> > confusion.  IO scalability wrt settings however is not an issue.
> >
> > Adoption of such a library brings up some interesting points, however:
> >
> > -Would one want to isolate the developer between HDF5 api given it's
> > flexibility and the coverage of the api one would have to re-pack to
> > expose many of HDFs features?
>
> I didn't comprehend this paragraph.

The data stored within HDF5 files is structured and the metadata makes
the files self-describing of their contents.  This allows external
programs such as the HDF File viewer, matlab, and pytables to work
with the files contents trivially in comparison to trying to work with
the opaquely stored blobs the tuple stores currently provide.  Will
this always be of use? I'd figure no, but much like you've said with
ini files, there's some nice benefits to being able to peek inside
these files and change values.  OTOH hdf5 allows a richer description
of the file format than an ini would.  If that file would be of
nontrivial complexity, hdf5 will win out as well in size due to it's
binary data oriented nature.

The other 2 points where to address that it is threadsafe (a frequent
point of confusion for this particular library) and the second , given
the api complexity of hdf5, would one want to wrap it in something
like the pa_database api.

>
>
> > -One could add a backend to the pulse database api, but this data
> > would then be stored as unstructured opaque data negating a number of
> > benefits to it's adoption in the first place, but preserving old api.
> >
> >
> > An example for me would be the compressing the relatively fat
> > equalizer files (gzip default did 6.8x time on my machine, I'd assume
> > better in HDF with things like the shuffle filter).
>
> I didn't comprehend this paragraph either. What does "6.8x time" refer
> to?
>
> How big are the equalizer files now?

The first part states that if hdf5 was simply wrapped in pa_database's
current api, it'd still be opaquely stored data, which would make
hdf5's usage somewhat pointless.  The data stored within would not be
self-describing but still obaque binary blobs.  This would preserve
the API though.

The second part shows, since hdf allows (gzip) compression and many
other transforms on the data it stores within, an example compression
using gzip compressing my machine's equalizer files.
47K     equalizer-presets.tdb -> 8.0K    equalizer-presets.tdb.gz
(depends on number of equalizer presets and sample rate of
equalizer/sink it's hooked up to)
82K     equalizer-state.tdb     -> 12K     equalizer-state.tdb.gz
(depends on number of channels and sample rate of the sink it's hooked
up to, 82/12 -> 6.8x compression)

>
>
> > It also would
> > allow users to trade equalizer files from machine to machine safely
> > along with additional metadata to allow transparent conversion from
> > one sampling rate to another for stored presets - lack of tradable
> > presets is the most frequent criticism I've received from users.
>
> The pa_database files indeed aren't portable.
>
> You didn't manage to convince me about the usefulness of the HDF5
> format. We try to keep the mandatory dependencies to minimum, and
> pulling libhdf5 as a dependency doesn't seem to bring very big benefits.
> I'm not entirely happy with pa_database either, though - I would
> personally prefer to use the same "ini-style" format for both state and
> configuration files (the difference between those two categories is
> small and sometimes unclear anyway). The state files tend to be trivial
> in size, so the space inefficiency of text files isn't really a concern.
> The benefits of a text-based format would be transparency/hackability
> and portability. If the equalizer files really are huge (I don't see why
> they would, though), then we still might want to use some other format
> for those files.

Point taken for typical sizes to be small, the number of the files
could be reduced however to a single file, but even so the space
savings are questionable in light of the near 4M hdf5 library. ini
files would not be a good fit for the current equalizer due to it's
exact storage of the filter states.  By the way - and just saying -
when I use plain text storage, yaml seems to be a better experience.
Unlike ini file format, it has an actual specification it follows and
is more human friendly than json, which it is a superset of:
http://en.wikipedia.org/wiki/YAML#JSON . One problem you run into any
plain text storage formats though is when floats are in use, the
stable round trip conversions of float->text->float are tricky,
especially if you would like to be able to have numbers like input
numbers .1 serialize back out as .1 . There are implementations that
solve this though, notably David M. Gay's floating point routines.
This is a problem the binary data storage methods do not have,
however.

-Jason