libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] return of getKeywords()


From: Ryan Underwood
Subject: Re: [libextractor] return of getKeywords()
Date: Fri, 30 Mar 2007 14:12:06 -0500
User-agent: Mutt/1.5.13 (2006-08-11)

On Fri, Mar 30, 2007 at 12:41:23PM -0600, Christian Grothoff wrote:
> 
> First of all, it is not clear that this is true: LE maybe able to handle some 
> (versions/variants) of a particular mime type, but not others.  Also, if an 
> extractor succeeds, it generally adds *at least* the MIME type of the file, 

Where?  I'm only looking at the plugin interface, and it seems most of
the plugins don't do anything regarding a MIME type - usually testing to
see if this is the right type of file and then bailing out if not.

> Furthermore, determining the MIME type itself is not always guaranteed to 
> work 
> (using whatever approach -- file extensions, libmagic, file -- all of these 
> can fail in corner cases).

Correct, that's why the application should get to decide if it wants to
trade off identification precision for speed.

> Finally, the performance should not be an issue: libextractor *first* 
> attempts 
> to determine the MIME type and uses this to possibly exclude running certain 
> plugins.  Furthermore, all extractors also first check if they actually might 
> apply to the given file, and bail out quickly if not.  As a result, 
> extracting metadata for a particular file is usually extremely fast.  Most of 
> the time is most likely spent loading libextractor's plugins the first time 
> (which is a one-time cost, not per-file, and would thus not reduced by your 
> proposal) and by the OS doing IO for the file itself. 

Yes, but by design the libextractor plugins are always performing I/O on
the file by opening and reading it.  This is where I am running into a
performance problem.

> Most of this IO overhead is again already spent when you just try to run 
> libmagic or any other non-trivial mime-type detector, so running "full" LE in 
> addition is unlikely to show up in performance at all.

Not per-file.  Since in my particular application I can be assured that
every file with extension .foo is actually an application/foo file,
there is only a preliminary libmagic check for file extensions that have
not yet had their MIME types identified and entered into the search
engine's map.  The libmagic thus performs I/O once *per encountered
extension*, not several times *per-file* as libextractor's plugin chain
does.

> In summary, I believe your proposed trade-off -- minimal gains in performance 
> vs. possibly not showing metadata that could have been found (and additional 
> complexity of the API of LE and of the client code) is not a good one.

I don't believe I suggested that metadata should be thrown out by API
design.   All I suggested is that libextractor should somehow make it
clear to the caller if his file was ignored by all plugins.  Then he can
choose himself if he cares about the risk of possibly missing data from
later files of the same type.  For example, in a file archive where most
files have exactly the same structure, and where I/O is expensive
(across a network filesystem), this would be a quite big optimization -
if one file is ignored, 10,000 others of the same type will also be
ignored, so why bother calling libextractor at all after one try?

An alternate approach would be to provide a "fast path" API by adding
another getKeywords3() function that takes as input the MIME type of the
file, and only calls the extractor that has claimed that MIME type,
instead of calling all of the extractors.  While this still wastes one
I/O on a mis-identified or ignored file, it saves quite a bit on all
others (especially those whose extractors are near the end of the
chain), and doesn't require modifying the existing API.

Does this make more sense?  I feel like I missed the mark with my first
mail.

Basically, the whole point is that if I am confident that I already know
the type of the file, it seems that libextractor is quite wasteful with
I/O in this case.  Its current design assumes that only the plugins --
which all perform I/O on the file and must all be executed in sequence
until one reports success -- can *truly* know the type of the file.  My
opinion is that it should be possible to make this decision outside of
the library (as in the fast path getKeywords3()), or for the application
to be able to decide not to call the library in the future when it
produced no results for a past file (as in a distinguishable "ignored"
result from getKeywords()).

-- 
Ryan Underwood, <address@hidden>

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]