Reducing the memory overhead of the mime system

Matthias Clasen mclasen at redhat.com
Mon Mar 28 23:11:48 EEST 2005


As things currently are, the xdgmime implementation of the shared mime
info spec uses ~50k of heapspace, and does a fair bit of text-file
parsing to populate it with data that rarely changes, and is the same
for all clients. To make things worse, xdgmime is used by source code
sharing, thus gtk+ and gnome-vfs have their own copies. As a
consequence, applications using gtk+ and gnome-vfs (with the new
filechooser, this means basically all gtk+ applications) pay the 50k
price twice.

An easy and well-established way to avoid both the parsing and the
memory overhead is to use an mmappable cache file. Below is a proposal
for such a format, which closely follows the in-memory data structures
currently used by xdgmime. Patches to make update-mime-database generate
such a cache file and to make xdgmime use the cache files are in
bugzilla (
https://bugs.freedesktop.org/show_bug.cgi?id=2804
https://bugs.freedesktop.org/show_bug.cgi?id=2805
)

To make sense of the specification below, some acquaintance with the
xdgmime data structures will probably be required...


Matthias


Header:
2			CARD16		MAJOR_VERSION	1	
2			CARD16		MINOR_VERSION	0	
4			CARD32		ALIAS_LIST_OFFSET
4			CARD32		PARENT_LIST_OFFSET
4			CARD32		LITERAL_LIST_OFFSET
4			CARD32		SUFFIX_LIST_OFFSET
4			CARD32		GLOB_LIST_OFFSET
4			CARD32		MAGIC_LIST_OFFSET

AliasList:
4			CARD32		N_ALIASES
8*N_ALIASES		AliasListEntry

AliasListEntry:
4			CARD32		ALIAS_OFFSET
4			CARD32		MIME_TYPE_OFFSET

ParentList:
4			CARD32		N_ENTRIES 
8*N_ENTRIES		ParentListEntry

ParentListEntry:
4			CARD32		MIME_TYPE_OFFSET
4			CARD32		PARENTS_OFFSET

Parents:
4			CARD32		N_PARENTS
4*N_PARENTS		CARD32		MIME_TYPE_OFFSET

LiteralList:
4			CARD32		N_LITERALS
8*N_LITERALS		LiteralEntry	

LiteralEntry:
4			CARD32		LITERAL_OFFSET
4			CARD32		MIME_TYPE_OFFSET

GlobList:
4			CARD32		N_GLOBS
8*N_GLOBS		GlobEntry	

GlobEntry:
4			CARD32		GLOB_OFFSET
4			CARD32		MIME_TYPE_OFFSET

SuffixTree:
4			CARD32		N_ROOTS
4	 		CARD32		FIRST_ROOT_OFFSET

SuffixTreeNode:
4			CARD32		CHARACTER
4			CARD32		MIME_TYPE_OFFSET
4			CARD32		N_CHILDREN			
4			CARD32		FIRST_CHILD_OFFSET

MagicList:
4			CARD32		N_MATCHES
4			CARD32		MAX_EXTENT
4			CARD32		FIRST_MATCH_OFFSET

Match:
4			CARD32		PRIORITY
4			CARD32		MIME_TYPE_OFFSET
4			CARD32		N_MATCHLETS
4			CARD32		FIRST_MATCHLET_OFFSET

Matchlet:
4			CARD32		RANGE_START
4			CARD32		RANGE_LENGTH
4			CARD32		WORD_SIZE
4			CARD32		VALUE_LENGTH
4			CARD32		VALUE
4			CARD32		MASK
4			CARD32		N_CHILDREN
4			CARD32		FIRST_CHILD_OFFSET


Notes:

* The list of aliases is sorted by alias, the list of 
  literal globs is sorted by the literal. The SuffixTreeNode
  siblings are sorted by character.

* All offsets are in bytes from he beginning of the file

* Strings are zero-terminated

* All numbers are in network (big-endian) order. This is
  necessary because the data will be stored in arch-independent
  directories like /usr/share/mime or even in user's 
  home directories.

* Cache files have to be written atomically - write to a
  temporary name, then move over the old file - so that
  clients that have the old cache file open and mmap'ed
  won't get corrupt data.




More information about the xdg mailing list