[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [libextractor] return of getKeywords()
From: |
Christian Grothoff |
Subject: |
Re: [libextractor] return of getKeywords() |
Date: |
Fri, 30 Mar 2007 12:41:23 -0600 |
User-agent: |
KMail/1.9.5 |
On Thursday 29 March 2007 09:04, Ryan Underwood wrote:
> Hello,
>
> It seems reasonable that getKeywords should have a different return type
> in the two cases where:
> 1) A file was handled, but had no keywords
> 2) No plugin could handle the file.
>
> The distinction is important for certain programs (which are indexing
> metadata based on file extensions, or on libmagic determine MIME type).
> In the first case, the program using libextractor would continue to try
> to extract data for that file type. In the second case, the program can
> ignore all future files of that type, resulting in a significant
> speedup.
First of all, it is not clear that this is true: LE maybe able to handle some
(versions/variants) of a particular mime type, but not others. Also, if an
extractor succeeds, it generally adds *at least* the MIME type of the file,
so getting nothing back is almost certainly a sign of complete failure -- but
a client cannot be sure that other types with the same MIME type will
actually fail.
Furthermore, determining the MIME type itself is not always guaranteed to work
(using whatever approach -- file extensions, libmagic, file -- all of these
can fail in corner cases).
Finally, the performance should not be an issue: libextractor *first* attempts
to determine the MIME type and uses this to possibly exclude running certain
plugins. Furthermore, all extractors also first check if they actually might
apply to the given file, and bail out quickly if not. As a result,
extracting metadata for a particular file is usually extremely fast. Most of
the time is most likely spent loading libextractor's plugins the first time
(which is a one-time cost, not per-file, and would thus not reduced by your
proposal) and by the OS doing IO for the file itself.
Most of this IO overhead is again already spent when you just try to run
libmagic or any other non-trivial mime-type detector, so running "full" LE in
addition is unlikely to show up in performance at all.
In summary, I believe your proposed trade-off -- minimal gains in performance
vs. possibly not showing metadata that could have been found (and additional
complexity of the API of LE and of the client code) is not a good one.
Best regards,
Christian