[pulseaudio-discuss] RFC: pa_database to adopt hdf5?

Mon Jun 24 07:59:42 PDT 2013

On Mon, 2013-06-17 at 22:32 -0700, Jason Newton wrote:
> Hi,
> 
> I've been using HDF5 [ http://www.hdfgroup.org/HDF5/whatishdf5.html ]
> in my own applications alot recently and I thought pulseaudio may have
> overlooked this library for it's own problems given HDF's
> scientific-field origins.   Specifically the problem it could solve in
> pulseaudio is the issue of storing structured data to disk.
> 
> HDF5 is very flexible and most usage of it results in highly
> structured data models that can be read and written pretty much
> anywhere on any architecture or platform transparently to the
> developer.  This goes from simple arrays to arbitrary structures and
> even arbitrary length bitfields. It also allows the storing of
> attributes for every object and the disk format is decoupled from the
> data model, such that one can read select subfields of a stored
> structure, for instance. Additionally, it allows per-dataset
> compression (pluggable, a builtin is gzip) and the ability to set the
> precisions of the types such that disk space is minimized but
> packing/unpacking bits is handled internally by the library such that
> the in-memory model remains simple primitives (e.g. unsigned int) or
> POD structures.
> 
> It could also allow for consolidation of the existing db files in
> ~/.pulse because it acts as pretty much a filesystem itself, allowing
> different and hierarchical grouping of objects: datasets, groups, data
> types, links (like symbolic links).   This allows modules to be be
> given their own path inside a single hdf5 file, if that is a worthy
> simplification to make over the existing spread of files.  While disk
> space is not a problem on my machine, I would expect this to reduce
> disk requirements by a few k at least.

According to [1], libhdf5 installed size is multiple megabytes, so any
disk space savings would be negated by the necessity to install the new
library.

[1] http://packages.debian.org/jessie/libhdf5-7

> Additionally, external tools like matlab/pytables can read this
> structured data and make sense of it without the user having to do
> much work beyond opening the object paths and one can use the
> HDFViewer to quickly inspect all the inner contents of an hdf file.
> 
> Also, I'll mention that HDF5 uses a global lock to stay threadsafe and
> does allow multiple writes concurrently to the same file from the api,
> but because of this it does not scale very well, a frequent point of
> confusion.  IO scalability wrt settings however is not an issue.
> 
> Adoption of such a library brings up some interesting points, however:
> 
> -Would one want to isolate the developer between HDF5 api given it's
> flexibility and the coverage of the api one would have to re-pack to
> expose many of HDFs features?

I didn't comprehend this paragraph.

> -One could add a backend to the pulse database api, but this data
> would then be stored as unstructured opaque data negating a number of
> benefits to it's adoption in the first place, but preserving old api.
> 
> 
> An example for me would be the compressing the relatively fat
> equalizer files (gzip default did 6.8x time on my machine, I'd assume
> better in HDF with things like the shuffle filter).

I didn't comprehend this paragraph either. What does "6.8x time" refer
to?

How big are the equalizer files now?

> It also would
> allow users to trade equalizer files from machine to machine safely
> along with additional metadata to allow transparent conversion from
> one sampling rate to another for stored presets - lack of tradable
> presets is the most frequent criticism I've received from users.

The pa_database files indeed aren't portable.

You didn't manage to convince me about the usefulness of the HDF5
format. We try to keep the mandatory dependencies to minimum, and
pulling libhdf5 as a dependency doesn't seem to bring very big benefits.
I'm not entirely happy with pa_database either, though - I would
personally prefer to use the same "ini-style" format for both state and
configuration files (the difference between those two categories is
small and sometimes unclear anyway). The state files tend to be trivial
in size, so the space inefficiency of text files isn't really a concern.
The benefits of a text-based format would be transparency/hackability
and portability. If the equalizer files really are huge (I don't see why
they would, though), then we still might want to use some other format
for those files.

-- 
Tanu