[pulseaudio-discuss] RFC: pa_database to adopt hdf5?

Mon Jun 17 22:32:11 PDT 2013

Hi,

I've been using HDF5 [ http://www.hdfgroup.org/HDF5/whatishdf5.html ]
in my own applications alot recently and I thought pulseaudio may have
overlooked this library for it's own problems given HDF's
scientific-field origins.   Specifically the problem it could solve in
pulseaudio is the issue of storing structured data to disk.

HDF5 is very flexible and most usage of it results in highly
structured data models that can be read and written pretty much
anywhere on any architecture or platform transparently to the
developer.  This goes from simple arrays to arbitrary structures and
even arbitrary length bitfields. It also allows the storing of
attributes for every object and the disk format is decoupled from the
data model, such that one can read select subfields of a stored
structure, for instance. Additionally, it allows per-dataset
compression (pluggable, a builtin is gzip) and the ability to set the
precisions of the types such that disk space is minimized but
packing/unpacking bits is handled internally by the library such that
the in-memory model remains simple primitives (e.g. unsigned int) or
POD structures.

It could also allow for consolidation of the existing db files in
~/.pulse because it acts as pretty much a filesystem itself, allowing
different and hierarchical grouping of objects: datasets, groups, data
types, links (like symbolic links).   This allows modules to be be
given their own path inside a single hdf5 file, if that is a worthy
simplification to make over the existing spread of files.  While disk
space is not a problem on my machine, I would expect this to reduce
disk requirements by a few k at least.

Additionally, external tools like matlab/pytables can read this
structured data and make sense of it without the user having to do
much work beyond opening the object paths and one can use the
HDFViewer to quickly inspect all the inner contents of an hdf file.

Also, I'll mention that HDF5 uses a global lock to stay threadsafe and
does allow multiple writes concurrently to the same file from the api,
but because of this it does not scale very well, a frequent point of
confusion.  IO scalability wrt settings however is not an issue.

Adoption of such a library brings up some interesting points, however:

-Would one want to isolate the developer between HDF5 api given it's
flexibility and the coverage of the api one would have to re-pack to
expose many of HDFs features?

-One could add a backend to the pulse database api, but this data
would then be stored as unstructured opaque data negating a number of
benefits to it's adoption in the first place, but preserving old api.

An example for me would be the compressing the relatively fat
equalizer files (gzip default did 6.8x time on my machine, I'd assume
better in HDF with things like the shuffle filter).  It also would
allow users to trade equalizer files from machine to machine safely
along with additional metadata to allow transparent conversion from
one sampling rate to another for stored presets - lack of tradable
presets is the most frequent criticism I've received from users.

-Jason