libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] return of getKeywords()


From: Christian Grothoff
Subject: Re: [libextractor] return of getKeywords()
Date: Fri, 30 Mar 2007 12:41:23 -0600
User-agent: KMail/1.9.5

On Thursday 29 March 2007 09:04, Ryan Underwood wrote:
> Hello,
>
> It seems reasonable that getKeywords should have a different return type
> in the two cases where:
> 1) A file was handled, but had no keywords
> 2) No plugin could handle the file.
>
> The distinction is important for certain programs (which are indexing
> metadata based on file extensions, or on libmagic determine MIME type).
> In the first case, the program using libextractor would continue to try
> to extract data for that file type.  In the second case, the program can
> ignore all future files of that type, resulting in a significant
> speedup.

First of all, it is not clear that this is true: LE maybe able to handle some 
(versions/variants) of a particular mime type, but not others.  Also, if an 
extractor succeeds, it generally adds *at least* the MIME type of the file, 
so getting nothing back is almost certainly a sign of complete failure -- but 
a client cannot be sure that other types with the same MIME type will 
actually fail.

Furthermore, determining the MIME type itself is not always guaranteed to work 
(using whatever approach -- file extensions, libmagic, file -- all of these 
can fail in corner cases).

Finally, the performance should not be an issue: libextractor *first* attempts 
to determine the MIME type and uses this to possibly exclude running certain 
plugins.  Furthermore, all extractors also first check if they actually might 
apply to the given file, and bail out quickly if not.  As a result, 
extracting metadata for a particular file is usually extremely fast.  Most of 
the time is most likely spent loading libextractor's plugins the first time 
(which is a one-time cost, not per-file, and would thus not reduced by your 
proposal) and by the OS doing IO for the file itself. 

Most of this IO overhead is again already spent when you just try to run 
libmagic or any other non-trivial mime-type detector, so running "full" LE in 
addition is unlikely to show up in performance at all.

In summary, I believe your proposed trade-off -- minimal gains in performance 
vs. possibly not showing metadata that could have been found (and additional 
complexity of the API of LE and of the client code) is not a good one.

Best regards,

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]