libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] return of getKeywords()


From: Christian Grothoff
Subject: Re: [libextractor] return of getKeywords()
Date: Fri, 30 Mar 2007 15:27:49 -0600
User-agent: KMail/1.9.5

On Friday 30 March 2007 14:56, Ryan Underwood wrote:
> > As you can see, the plugin first checks if this is a JPEG, and then if it
> > is, instantly adds the MIME type.  So even if the JPEG does not contains
> > any other metadata, you'll always get at least the MIME type.
>
> I was referring to the "return prev" and friends after preliminary
> sanity checks.  This is still I/O, but I see the point; the file is
> already opened, so most of the damage is done by that point, especially
> on a network filesystem which caches the whole file on open.

I am not sure that this is actually true (NSF/network file systems caching the 
entire file on mmap/open) in general.  Maybe you want to profile this 
(generate a huge, 2 GB file, mmap, see what happens).

> > Actually, you can already do this with the existing API.  All you need is
> > (manually) construct a mapping of mime-types/file-extensions to LE plugin
> > names (based on your assumptions of what mime-types/extensions could
> > possibly be handled by a particular plugin) and then just
> > use "EXTRACTOR_addLibrary(NULL, "pluginname")" to load just the right
> > plugin for each extension (also avoids the cost of loading useless
> > plugins!). Keep the resulting ExtractorList's in memory (and re-use for
> > all files of that type/extension).  So this can easily be done without
> > changing the API at all.
>
> This sounds good; are the LE plugin names static enough to rely upon in
> compiled code?

Yes.  We do not change those around -- after all, users can use them in 
configuration files (see for example the GNUnet FS EXTRACTOR configuration 
option where users specify which plugins they want to load).  Naturally, 
there maybe new plugins from time to time, so ideally you may want to put 
this information not into the binary but into a configuration file (mapping 
extensions/mime-types to LE plugins).  Ship with a reasonable default, and 
most users will never have to worry about it.

> > Well, the above optimization allows you to avoid calling plugins that you
> > do not like to call.  But again, note that no IO is done if you use
> > getKeywords2, and even with getKeywords, IO is only done once
> > per "getKeywords" call, never once per plugin.
>
> Yeah, as I said above, noticing that mmap is used instead of reading
> into buffer, alleviates another concern.
>
> My main concern is still avoiding that initial file open, because it is
> quite expensive here.  In order to do that in my application, I have to
> be able to tell if libextractor handled the file or ignored it.

I think it is better to configure you application to specifically use or avoid 
LE for particular extensions/mime-types (with a configuration file) instead 
of changing LE to provide heuristic information.  Also, this way you will be 
guaranteed deterministic behavior from your application: for extensions where 
LE is enabled, it will *always* be run, because it *always* makes sense.  For 
those, where it does not make sense, LE will *never* be run, instead of once 
for a (random) first file.

> Based on your previous comments noting that the plugins set the mime
> type, it seems like an EXTRACTOR_MIMETYPE keyword type will only ever be
> set if any plugin claims the file.  So could I then assume that if no
> keywords of type EXTRACTOR_MIMETYPE exist, then the file was effectively
> ignored by all plugins that were loaded?

Not exactly, since some plugins do not set MIMETYPE because they cannot be 
sure (HTML is one example).  Also, some plugins are not mime-specific 
(filename, printable, hash).  These plugins may or may not add metadata, but 
will never set a mimetype.

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]