Re: [libextractor] return of getKeywords()

libextractor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] return of getKeywords()

From:	Ryan Underwood
Subject:	Re: [libextractor] return of getKeywords()
Date:	Fri, 30 Mar 2007 15:56:14 -0500
User-agent:	Mutt/1.5.13 (2006-08-11)

On Fri, Mar 30, 2007 at 01:42:23PM -0600, Christian Grothoff wrote:
> >
> > Where?  I'm only looking at the plugin interface, and it seems most of
> > the plugins don't do anything regarding a MIME type - usually testing to
> > see if this is the right type of file and then bailing out if not.
> 
> Here's a canonical example (similar code all over the place):
> 
> struct EXTRACTOR_Keywords * 
> libextractor_jpeg_extract(const char * filename,
>                                         unsigned char * data,
>                                         size_t size,
>                                         struct EXTRACTOR_Keywords * prev) {
>   int c1;
>   int c2;
>   unsigned char * end;
>   struct EXTRACTOR_Keywords * result;
> 
>   if (size < 0x12)
>     return prev;
>   result = prev;
>   end = &data[size];
>   c1 = NEXTC(&data, end);
>   c2 = NEXTC(&data, end);
>   if ( (c1 != 0xFF) || (c2 != M_SOI) )
>     return result; /* not a JPEG */
>   result = addKeyword(EXTRACTOR_MIMETYPE,
>                       strdup("image/jpeg"),
>                       result);
>    // ...
> }
> 
> As you can see, the plugin first checks if this is a JPEG, and then if it is, 
> instantly adds the MIME type.  So even if the JPEG does not contains any 
> other metadata, you'll always get at least the MIME type.

I was referring to the "return prev" and friends after preliminary
sanity checks.  This is still I/O, but I see the point; the file is
already opened, so most of the damage is done by that point, especially
on a network filesystem which caches the whole file on open.

> > Yes, but by design the libextractor plugins are always performing I/O on
> > the file by opening and reading it.  This is where I am running into a
> > performance problem.
> 
> Not true.  Use the "getKeywords2" API function.  You should be able to mmap 
> the file once, then do whatever you want (libmagic, libextractor).  With 
> getKeywords2, LE will NOT do any additional IO/system calls (and obviously 
> for determining the mime-type, you need to have somebody open/read the file).

Not if the mime-type is assumed from the extension - which is braindead
in the GENERAL case, but not in my case.

> LE only does IO once per file, not several times per-file.

OK, mmap saves you here.  So my concern is reduced to the file open (and
initial caching).

> Also, say you have a ".pdf" file and the mime-type is correct.  How do
> you propose to handle different PDF versions (say LE supports up to
> version 1.4, but some of your files are PDF 1.5 or 1.6)?

If I know that I have potentially many files of the same type with
different creators, then I don't write a stupid application that assumes
they are all 1.4 by not attempting to extract anymore after the first.
But if I know that all my PDF files are created under the same
circumstance, then I save work by not attempting to extract the rest of
the files.

> Mime-types are not always sufficient to completely determine that an
> LE plugin applies or not (in particular for certain plugins, like
> printable).

Correct.  Not always.

> Actually, you can already do this with the existing API.  All you need is 
> (manually) construct a mapping of mime-types/file-extensions to LE plugin 
> names (based on your assumptions of what mime-types/extensions could possibly 
> be handled by a particular plugin) and then just 
> use "EXTRACTOR_addLibrary(NULL, "pluginname")" to load just the right plugin 
> for each extension (also avoids the cost of loading useless plugins!). Keep 
> the resulting ExtractorList's in memory (and re-use for all files of that 
> type/extension).  So this can easily be done without changing the API at all.

This sounds good; are the LE plugin names static enough to rely upon in
compiled code?

> Well, the above optimization allows you to avoid calling plugins that you do 
> not like to call.  But again, note that no IO is done if you use 
> getKeywords2, and even with getKeywords, IO is only done once 
> per "getKeywords" call, never once per plugin.

Yeah, as I said above, noticing that mmap is used instead of reading
into buffer, alleviates another concern.

My main concern is still avoiding that initial file open, because it is
quite expensive here.  In order to do that in my application, I have to
be able to tell if libextractor handled the file or ignored it.

Based on your previous comments noting that the plugins set the mime
type, it seems like an EXTRACTOR_MIMETYPE keyword type will only ever be
set if any plugin claims the file.  So could I then assume that if no
keywords of type EXTRACTOR_MIMETYPE exist, then the file was effectively
ignored by all plugins that were loaded?

-- 
Ryan Underwood, <address@hidden>

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[libextractor] return of getKeywords(), Ryan Underwood, 2007/03/30
- Re: [libextractor] return of getKeywords(), Christian Grothoff, 2007/03/30
  - Re: [libextractor] return of getKeywords(), Ryan Underwood, 2007/03/31
    - Re: [libextractor] return of getKeywords(), Christian Grothoff, 2007/03/30
    - Re: [libextractor] return of getKeywords(), Ryan Underwood <=
    - Re: [libextractor] return of getKeywords(), Christian Grothoff, 2007/03/30

Prev by Date: Re: [libextractor] return of getKeywords()
Next by Date: [libextractor] source code
Previous by thread: Re: [libextractor] return of getKeywords()
Next by thread: Re: [libextractor] return of getKeywords()
Index(es):
- Date
- Thread