Re: [libextractor] return of getKeywords()

libextractor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] return of getKeywords()

From:	Christian Grothoff
Subject:	Re: [libextractor] return of getKeywords()
Date:	Fri, 30 Mar 2007 13:42:23 -0600
User-agent:	KMail/1.9.5

On Friday 30 March 2007 13:12, Ryan Underwood wrote:
> On Fri, Mar 30, 2007 at 12:41:23PM -0600, Christian Grothoff wrote:
> > First of all, it is not clear that this is true: LE maybe able to handle
> > some (versions/variants) of a particular mime type, but not others. 
> > Also, if an extractor succeeds, it generally adds *at least* the MIME
> > type of the file,
>
> Where?  I'm only looking at the plugin interface, and it seems most of
> the plugins don't do anything regarding a MIME type - usually testing to
> see if this is the right type of file and then bailing out if not.

Here's a canonical example (similar code all over the place):

struct EXTRACTOR_Keywords * 
libextractor_jpeg_extract(const char * filename,
                                        unsigned char * data,
                                        size_t size,
                                        struct EXTRACTOR_Keywords * prev) {
  int c1;
  int c2;
  unsigned char * end;
  struct EXTRACTOR_Keywords * result;

  if (size < 0x12)
    return prev;
  result = prev;
  end = &data[size];
  c1 = NEXTC(&data, end);
  c2 = NEXTC(&data, end);
  if ( (c1 != 0xFF) || (c2 != M_SOI) )
    return result; /* not a JPEG */
  result = addKeyword(EXTRACTOR_MIMETYPE,
                      strdup("image/jpeg"),
                      result);
   // ...
}

As you can see, the plugin first checks if this is a JPEG, and then if it is, 
instantly adds the MIME type.  So even if the JPEG does not contains any 
other metadata, you'll always get at least the MIME type.

> > Furthermore, determining the MIME type itself is not always guaranteed to
> > work (using whatever approach -- file extensions, libmagic, file -- all
> > of these can fail in corner cases).
>
> Correct, that's why the application should get to decide if it wants to
> trade off identification precision for speed.
>
> > Finally, the performance should not be an issue: libextractor *first*
> > attempts to determine the MIME type and uses this to possibly exclude
> > running certain plugins.  Furthermore, all extractors also first check if
> > they actually might apply to the given file, and bail out quickly if not.
> >  As a result, extracting metadata for a particular file is usually
> > extremely fast.  Most of the time is most likely spent loading
> > libextractor's plugins the first time (which is a one-time cost, not
> > per-file, and would thus not reduced by your proposal) and by the OS
> > doing IO for the file itself.
>
> Yes, but by design the libextractor plugins are always performing I/O on
> the file by opening and reading it.  This is where I am running into a
> performance problem.

Not true.  Use the "getKeywords2" API function.  You should be able to mmap 
the file once, then do whatever you want (libmagic, libextractor).  With 
getKeywords2, LE will NOT do any additional IO/system calls (and obviously 
for determining the mime-type, you need to have somebody open/read the file).

> > Most of this IO overhead is again already spent when you just try to run
> > libmagic or any other non-trivial mime-type detector, so running "full"
> > LE in addition is unlikely to show up in performance at all.
>
> Not per-file.  Since in my particular application I can be assured that
> every file with extension .foo is actually an application/foo file,
> there is only a preliminary libmagic check for file extensions that have
> not yet had their MIME types identified and entered into the search
> engine's map.  The libmagic thus performs I/O once *per encountered
> extension*, not several times *per-file* as libextractor's plugin chain
> does.

LE only does IO once per file, not several times per-file.  Also, say you have 
a ".pdf" file and the mime-type is correct.  How do you propose to handle 
different PDF versions (say LE supports up to version 1.4, but some of your 
files are PDF 1.5 or 1.6)?  Mime-types are not always sufficient to 
completely determine that an LE plugin applies or not (in particular for 
certain plugins, like printable).

> > In summary, I believe your proposed trade-off -- minimal gains in
> > performance vs. possibly not showing metadata that could have been found
> > (and additional complexity of the API of LE and of the client code) is
> > not a good one.
>
> I don't believe I suggested that metadata should be thrown out by API
> design.   All I suggested is that libextractor should somehow make it
> clear to the caller if his file was ignored by all plugins.

In general, there are plugins that will never ignore a file (printable, hash, 
filename).  Besides, defining what "ignored" truly means is difficult as 
well.  Say you have an mp3 file.  The ID3 extractor may "ignore" it -- 
because there is no ID3 tag at the end.  However, for the next mp3-file, 
there is an ID3 tag and the extractor takes it.  I think your basic idea 
about not running LE on certain files based on mime-types is somewhat flawed.

> Then he can 
> choose himself if he cares about the risk of possibly missing data from
> later files of the same type.  For example, in a file archive where most
> files have exactly the same structure, and where I/O is expensive
> (across a network filesystem), this would be a quite big optimization -
> if one file is ignored, 10,000 others of the same type will also be
> ignored, so why bother calling libextractor at all after one try?

If you can rely on extensions to be accurate and can live with the possibility 
of being sometimes wrong (not run LE even though it would give you metadata 
because your heuristic was wrong), then I can see this kind of optimization 
helping performance.  

> An alternate approach would be to provide a "fast path" API by adding
> another getKeywords3() function that takes as input the MIME type of the
> file, and only calls the extractor that has claimed that MIME type,
> instead of calling all of the extractors.  While this still wastes one
> I/O on a mis-identified or ignored file, it saves quite a bit on all
> others (especially those whose extractors are near the end of the
> chain), and doesn't require modifying the existing API.

Actually, you can already do this with the existing API.  All you need is 
(manually) construct a mapping of mime-types/file-extensions to LE plugin 
names (based on your assumptions of what mime-types/extensions could possibly 
be handled by a particular plugin) and then just 
use "EXTRACTOR_addLibrary(NULL, "pluginname")" to load just the right plugin 
for each extension (also avoids the cost of loading useless plugins!). Keep 
the resulting ExtractorList's in memory (and re-use for all files of that 
type/extension).  So this can easily be done without changing the API at all.

> Does this make more sense?  I feel like I missed the mark with my first
> mail.
>
> Basically, the whole point is that if I am confident that I already know
> the type of the file, it seems that libextractor is quite wasteful with
> I/O in this case.  Its current design assumes that only the plugins --
> which all perform I/O on the file and must all be executed in sequence
> until one reports success -- can *truly* know the type of the file.

Well, the above optimization allows you to avoid calling plugins that you do 
not like to call.  But again, note that no IO is done if you use 
getKeywords2, and even with getKeywords, IO is only done once 
per "getKeywords" call, never once per plugin.

> My 
> opinion is that it should be possible to make this decision outside of
> the library (as in the fast path getKeywords3()), or for the application
> to be able to decide not to call the library in the future when it
> produced no results for a past file (as in a distinguishable "ignored"
> result from getKeywords()).

I think that by loading precisely the desired set of plugins, you can achieve 
what you intend.  The problem is that the plugins cannot ever really be sure 
which mime-types may apply (HTML-plugin may work for PHP code, for example), 
so I'd say your code should make this kind of heuristic decision if this kind 
of trade-off is actually needed.

Best regards,

Christian

[Prev in Thread]

Current Thread

[Next in Thread]

[libextractor] return of getKeywords(), Ryan Underwood, 2007/03/30
- Re: [libextractor] return of getKeywords(), Christian Grothoff, 2007/03/30
  - Re: [libextractor] return of getKeywords(), Ryan Underwood, 2007/03/31
    - Re: [libextractor] return of getKeywords(), Christian Grothoff <=
    - Re: [libextractor] return of getKeywords(), Ryan Underwood, 2007/03/31
    - Re: [libextractor] return of getKeywords(), Christian Grothoff, 2007/03/30

Prev by Date: Re: [libextractor] return of getKeywords()
Next by Date: Re: [libextractor] return of getKeywords()
Previous by thread: Re: [libextractor] return of getKeywords()
Next by thread: Re: [libextractor] return of getKeywords()
Index(es):
- Date
- Thread