Improving type detection

Fri May 2 07:31:11 PDT 2014

Hi Maxim,

On Fri, 2014-05-02 at 12:41 +0300, Maxim Monastirsky wrote:
> On Thursday 01 May 2014 09:29:48 Kohei Yoshida wrote:
> > So, I looked over those changes, and I do like the changes. :-)
> Thanks Kohei!
> 
> > He was concerned about having to "detect" zip
> > storage over and over again which he rightly said was not great for
> > performance.
> 
> It makes me think of another point. There are some detectors that do exactly 
> the same detection procedure for all supported types. For example - oox, xml, 
> and now the new storage one. If such detector didn't detect anything useful 
> once, we can be sure that it won't detect anything also in the next runs. So 
> it doesn't make sense to run it again and again.

I agree.  I think it makes sense to leave some data such as

* this is (not) a zip storage.
* this is (not) a valid ooxml format.
* this is (not) a valid ODF format.
* this is (not) a valid BIFF storage.

etc., and I can imagine storing these pieces of information with the
MediaDescriptor instance to help the subsequent detectors to skip
redundant detection routines.  Actually maybe we could just specify the
type of detected storage type such as

"DetectedStorage"

  + not detected -> detector should try to detect and store the result.
  + zip
  + gzip
  + biff
  + etc

"DetectedXMLType"

  + not detected -> detector should try to detect the XML type and store
the result.
  + ODF
  + OOXML

so that we can just store all this information using just one slot of
the MediaDescriptor rather than storing multiple boolean values.

Having said that, I don't think we have to go to the extent that "hey,
this is definitely not "XYZ format", don't bother trying to detect it".
The idea itself may make sense, but the way the detection services are
currently set up would make it a bit challenging to implement such
additoinal checks.  And since the number of file formats to detect
against is quite small (~120), simply iterating over all of them should
not cause a performance issue once we put the above mechanism to avoid
redundant checks.

> Maybe we can store a list of such detectors in some config file, and add a 
> corresponding check to the detection loop. This also would be a bit cleaner 
> solution for fdo#46310. What is the best place to store such list?

We already have a list of detectors, and they are sorted in order of
complexity for strategic reasons.
filter/source/config/cache/typedetection.cxx is the place where the list
is stored and maintained.  But as I said above, I'd like us to try the
above mechanim first and see if that will improve the situation a bit.
I'm a bit cautious with trying to either shorten or reorder this master
detector list since I've seen doing such things caused quite
hard-to-debug (and fix) format detection bugs in the past.

Best,

Kohei