[clahey at ximian.com: docbook mime type detection], > (Bastien Nocera)

Wed Jul 27 15:49:44 EEST 2005

Bastien Nocera <hadess at hadess.net> wrote:
> Do you really need to go 200 bytes into the file? The further you need
> to go into the file, the more expensive it is. I would also add that my
Of course it is more expensive, but in reality matching 200 or 2048
bytes makes no real difference taking the time needed for stat'ing and
additionally reading the file into account.

Three month ago I measured the differences for various patterns. There
was a difference between filename matching and pattern matching (of
course), but I was not able to see a true difference for various match
lengths or ranges. Filesystem I/O was much more time consuming than the 
matching (tested 30 different filetypes with a simple algorithm 
implemented with java on Sun/???, Linux/ext3, IRIX/xfs and XP/ntfs 
without big differences. Intensive testing on Linux/ext3: warm cache vs. 
cold cache: average around 1:120, deviation around 40%).

My conclusion was: matching is cheap, stat is evil, reading is evil.

An algorithm testing _every_ match to get _every_ possible mimetype 
(using priority for order) for a file takes around 3-8% more time than 
an algorithm who stops after the first positive match (cold cache, 
around 10000 files, Linux/ext3 and XP/ntfs without big differences).

-Timo-

--
----------------------------------------------------------------------------
Timo Stuelten -- Braunschweig -- Germany -- timo at stuelten.de