[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GNUnet-SVN] r9950 - Extractor-docs/WWW
From: |
gnunet |
Subject: |
[GNUnet-SVN] r9950 - Extractor-docs/WWW |
Date: |
Fri, 1 Jan 2010 14:47:21 +0100 |
Author: grothoff
Date: 2010-01-01 14:47:21 +0100 (Fri, 01 Jan 2010)
New Revision: 9950
Modified:
Extractor-docs/WWW/documentation.html
Extractor-docs/WWW/index.html
Log:
docu
Modified: Extractor-docs/WWW/documentation.html
===================================================================
--- Extractor-docs/WWW/documentation.html 2010-01-01 13:24:36 UTC (rev
9949)
+++ Extractor-docs/WWW/documentation.html 2010-01-01 13:47:21 UTC (rev
9950)
@@ -17,11 +17,11 @@
<link rel="SHORTCUT ICON" href="http://gnunet.org/libextractor/favicon.ico">
</head>
<body>
-
-
<table width="99%" border="0" cellpadding="0" cellspacing="0">
-<tbody><tr><td colspan="2" width="99%" bgcolor="#99bbff"
align="center">libextractor - Documentation</td></tr>
-<tr><td valign="top"><table width="15%" border="0" cellpadding="2"
cellspacing="3">
+<tbody>
+<tr><td colspan="2" width="99%" bgcolor="#99bbff" align="center">libextractor
- Documentation</td></tr>
+<tr><td valign="top">
+<table width="15%" border="0" cellpadding="2" cellspacing="3">
<tbody><tr><th nowrap="nowrap" bgcolor="99BBFF"><a
href="libextractor.html">Home</a></th></tr>
<tr><th nowrap="nowrap" bgcolor="99BBFF"><a
href="download.html">Download</a></th></tr>
<tr><th nowrap="nowrap" bgcolor="99BBFF"><a
href="documentation.html">Documentation</a></th></tr>
@@ -37,11 +37,11 @@
This documentation covers the major aspects of libextractor.
The man pages for <a href="man/extract.html">extract</a> and <a
href="man/libextractor.html">libextractor</a> are also on-line.
<br>
-An article describing libextractor was published in the <a
href="http://www.linuxjournal.com/">LinuxJournal</a> and is available <a
href="http://www.linuxjournal.com/article/7552">here</a>.
+An article describing libextractor was published in the <a
href="http://www.linuxjournal.com/">LinuxJournal</a> and is available <a
href="http://www.linuxjournal.com/article/7552">here</a>. That article
describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x
API.
<a name="copyright"></a>
<h2>Copyright and Contributions</h2>
-libExtractor is released under the GNU General Public License.
+libextractor is released under the GNU General Public License.
All contributions must thus be put under the <a
href="http://www.gnu.org/copyleft/gpl.html">GNU Public License (GPL)</a> or a
compatible license.
<h3>Mailing lists</h3>
@@ -64,8 +64,8 @@
<p>
Development of libextractor, and GNU in general, is a volunteer
effort, and you can contribute. For information, please
-read <a href="/help/">How to help GNU</a>. If you'd like to get
-involved, it's a good idea to join the mailing list (see above).
+read <a href="/help/">How to help GNU</a>. If you would like to get
+involved, it is a good idea to join the mailing list (see above).
</p>
<dl>
@@ -101,13 +101,10 @@
<pre>
# apt-get install libextractor-dev extract
</pre>
-If you want to compile libextractor from source you will need an
-unusual amount of memory: 256 MB system memory is roughly the minimum,
-since gcc will take about 200 MB to compile one of the plugins.
-Otherwise, compiling by hand follows the usual sequence:
+Compiling by hand follows the usual sequence:
<pre>
-$ tar xzvf libextractor.x.x.x.tar.gz
-$ cd libextractor.x.x.x
+$ tar xzvf libextractor.x.y.z.tar.gz
+$ cd libextractor.x.y.z
$ ./configure
$ make
# make install
@@ -124,11 +121,11 @@
<p>
After installing libextractor, the extract tool can be used to obtain
-meta-data from documents. By default, the extract tool uses the
-canonical set of plugins, which consists of all file-format-specific
+meta data from documents. By default, the extract tool uses the
+canonical set of plugins, which consists of all format-specific
plugins supported by the current version of libextractor together with
the mime-type detection plugin. If you are a user
-of <a
href="http://www.ecst.csuchico.edu/%7Ejacobsd/bib/formats/bibtex.html">BibTeX</a>
+of <a
href="http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html">BibTeX</a>
the option <tt>-b</tt> is likely to come in handy to automatically
create bibtex entries from documents that have been properly equipped
with meta-data:
@@ -148,25 +145,7 @@
}
</pre>
</p>
-
<p>
-Another interesting option is <tt>-B LANG</tt>. This option loads one
-of the language specific (but format-agnostic) plugins. These plugins
-attempt to find plaintext in a document by matching strings in the
-document against a dictionary. If the need for 200 MB of memory to
-compile libextractor seems mysterious, the answer lies in these
-plugins. In order to be able to perform a fast dictionary search,
-a <a href="https://ng.gnunet.org/bloomfilter">bloomfilter</a>
-is created that allows fast probabilistic matching; gcc finds the
-resulting datastructure a bit hard to swallow. The option <tt>-B</tt>
-is useful for formats that are undocumented, currently unsupported or
-for full-text search. Note that the printable plugins typically print
-the entire text of the document in order.
-</p>
-
-<p>
-The supported languages at the moment are Danish (da), German (de), English
(en), Spanish (es), Italian (it) and Norvegian (no).
-Supporting other languages is merely a question of adding (free) dictionaries
in an appropriate character set.
Further options are described in the extract manpage
(<tt>man 1 extract</tt>).
</p>
<p>
@@ -175,6 +154,7 @@
<h3>Examples:</h3>
<pre>
$ extract libextractor-0.1.3-1.src.rpm
+Keywords for file libextractor-0.1.3-1.src.rpm:
os - linux
resource-identifier - http://ovmj.org/libextractor/
group -System Environment/Libraries
@@ -191,46 +171,44 @@
unknown - SOURCE RPM 3.0
mimetype - application/x-rpm
</pre>
-<pre>$ extract extractor_logo.png
-unknown - The libextractor logo
+<pre>
+$ extract extractor_logo.png
+Keywords for file extractor_logo.png:
+image dimensions - 272x188
+thumbnail - (binary, 5932 bytes)
+image dimensions - 272x188
+thumbnail - (binary, 6427 bytes)
+image dimensions - 272x188
+thumbnail - (binary, 6427 bytes)
mimetype - image/png
+mimetype - image/png
+image dimensions - 272x188
+keywords - The libextractor logo
</pre>
-<p>
-The following is the output of extract for a Winword document using the
plaintext extractors:
-</p>
-<pre>
-$ wget -q http://www.bayern.de/HDBG/polges.doc
-$ extract -B de polges.doc | head -n 4
-unknown - FEE Politische Geschichte Bayerns
-Herausgegeben vom Haus der Geschichte als Heft
-der zur Geschichte und Kultur Redaktion Manfred
- Bearbeitung Otto Copyright Haus der Geschichte
-M�nchen Gestaltung f�rs Internet Rudolf Inhalt im.
-unknown - und das Deutsche Reich.
-unknown - und seine.
-unknown - Henker im Zeitalter von Reformation und Gegenreformation.
-</pre>
<h2>Using the libextractor library</h2>
<p>
-The following listing shows the code of a minimalistic program that uses
libextractor.
-Compiling the fragment requires passing the option <tt>-lextractor</tt> to gcc.
-The <tt>EXTRACTOR_KeywordList</tt> is a simple linked list containing a
keyword and a keyword type.
-For details and additional functions for loading plugins and manipulating the
keyword list, see
+The following listing shows the code of a minimalistic program that
+uses libextractor. Compiling the fragment requires passing the
+option <tt>-lextractor</tt> to gcc. For details and additional
+functions for loading plugins and manipulating the keyword list, see
the libextractor manpage (<tt>man 3 libextractor</tt>).
-Java programmers should note that a Java class that uses JNI to communicate
with libextractor is also available.
-Python programmers will find that libextractor (since 0.5.0) can also be used
from Python, just <tt>import Extractor</tt>.
+Java programmers should note that a Java class that uses JNI to
+communicate with libextractor is also available. Python programmers
+will find that libextractor (since 0.5.0) can also be used from
+Python, just <tt>import Extractor</tt>.
<br>
<pre>
-int main(int argc, char * argv[]) {
- EXTRACTOR_ExtractorList *extractors
- = EXTRACTOR_loadDefaultLibraries();
- EXTRACTOR_KeywordList *keywords
- = EXTRACTOR_getKeywords(extractors, argv[1]);
- EXTRACTOR_printKeywords(stdout,
- keywords);
- EXTRACTOR_freeKeywords(keywords);
- EXTRACTOR_removeAll(extractors);
+#include <extractor.h>
+
+int main(int argc, char * argv[])
+{
+ struct EXTRACTOR_PluginList *plugins
+ = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
+ EXTRACTOR_extract (plugins, argv[1],
+ NULL, 0,
+ &EXTRACTOR_meta_data_print, stdout);
+ EXTRACTOR_plugin_remove_all (plugins);
}
</pre>
</p>
@@ -277,51 +255,58 @@
<p>
The most complicated thing when writing a new plugin for libextractor is the
writing of the actual parser for a specific format.
Nevertheless, the basic pattern is always the same.
-The plugin library must be called <tt>libextractor_XXX.so</tt> where XXX
denotes the file format supported by the plugin.
-The library must export a method <tt>libextractor_XXX_extract</tt> with the
following signature:
+The plugin library must be called <tt>libextractor_XXX.so</tt> where XXX
denotes the file format supported by the plugin and
+must be placed in the plugin directory (typically
<tt>$PREFIX/lib/libextractor/</tt>).
+The library must export a method <tt>EXTRACTOR_XXX_extract</tt> with the
following signature:
<pre>
-struct EXTRACTOR_Keywords *
-libextractor_XXX_extract (char * filename,
- char * data,
- size_t size,
- struct EXTRACTOR_Keywords * prev,
- const char* options);
+int
+EXTRACTOR_XXX_extract (const char *data,
+ size_t size,
+ EXTRACTOR_MetaDataProcessor proc,
+ void *proc_cls,
+ const char* options);
</pre>
</p>
<p>
-The argument filename specifies the name of the file that is being processed.
-<tt>data</tt> is a pointer to the (typically mmapped) contents of the
-file, and size is the filesize. Most plugins to not make use of the
-filename and just directly parse data directly, staring by verifying
-that the header of the data matches the specific format.
-<tt>prev</tt> is the list of keywords that have been extracted so far by other
plugins for the file.
-The function is expected to return an updated list of keywords.
-The keywords are supposed to be converted into the UTF-8 character set by the
plugin.
-If the format does not match the expectations of the plugin, <tt>prev</tt> is
returned.
-Most plugins use a function like <tt>addKeyword</tt> to extend the list:
+<tt>data</tt> is a pointer to the contents of the
+file, and <tt>size</tt> is the number of bytes available in <tt>data</tt>. Most
+plugins starting by verifying that <tt>size</tt> is sufficiently large and
+that the header of data matches the specific format.
+The <tt>extract</tt> function is expected to call <tt>proc</tt> with each
+meta data item found. <tt>proc_cls</tt> must be passed as the first
+argument to <tt>proc</tt>, the other arguments correspond to the meta data
found.
+Finally, <tt>options</tt> is an arbitrary string of options that the plugin is
+free to interpret. Most plugins ignore <tt>options</tt>.
</p>
+<p>
+If the meta data extracted is a string, it issupposed to be converted
+into the UTF-8 character set by the plugin. However, in cases where
+the character encoding used in the document is unknown, no conversion
+should be done. Binary meta data can also be extracted. Plugins
+indicate the format of the meta data using the <tt>format</tt>
+argument to <tt>proc</tt>. Supported formats are UTF-8 strings, C
+Strings (for strings of unknown encoding) and binary data. In
+addition to this rough categorization, the plugin is also supposed to
+indicate the mime type of the meta data. For strings, that mime type
+is most often <tt>text/plain</tt>. Finally, the plugin must specify
+the meta data type. Common meta data types are "author",
+"title" and "mime-type". The full signature of
+the "proc" callback is:
+</p>
<pre>
-static void addKeyword(struct EXTRACTOR_Keywords ** list,
- char * keyword,
- EXTRACTOR_KeywordType type)
-{
- EXTRACTOR_KeywordList * next;
- next = malloc(sizeof(EXTRACTOR_KeywordList));
- next->next = *list;
- next->keyword = keyword;
- next->keywordType = type;
- *list = next;
-}
+typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
+ const char *plugin_name,
+ enum EXTRACTOR_MetaType type,
+ enum EXTRACTOR_MetaFormat format,
+ const char *data_mime_type,
+ const char *data,
+ size_t data_len);
</pre>
<p>
-A typical use of <tt>addKeyword</tt> is to add the mime-type once the
-file format has been established. For example, the JPEG-extractor
-checks the first bytes of the JPEG header and then either aborts or
-claims the file to be a JPEG. Note that the <tt>strdup</tt> in the
-code is important since the string will be deallocated later,
-typically in <tt>EXTRACTOR_freeKeywords()</tt>. A list of supported
-keyword classifications (in the example <tt>EXTRACTOR_MIMETYPE</tt>)
-can be found in the <tt>extractor.h</tt> header file.
+If "proc" returns non-zero, the plugin should abort and
+return non-zero itself. The "extract" function should
+always return zero unless a call to "proc" returned
+non-zero, in which case the plugin must return 1.
</p>
</td>
</tr>
Modified: Extractor-docs/WWW/index.html
===================================================================
--- Extractor-docs/WWW/index.html 2010-01-01 13:24:36 UTC (rev 9949)
+++ Extractor-docs/WWW/index.html 2010-01-01 13:47:21 UTC (rev 9950)
@@ -2,8 +2,11 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>GNU libextractor - GNU Project - Free Software Foundation</title>
-<meta name="content-language" content="en"><meta name="language"
content="en"><meta name="description" content="a simple library for keyword
extraction"><meta name="author" content="Vids Samanta and Christian Grothoff">
-<meta name="rights" content="(C) 2002,2003,2004,2005,2006,2007,2009 by Vids
Samanta and Christian Grothoff">
+<meta name="content-language" content="en">
+<meta name="language" content="en">
+<meta name="description" content="a simple library for keyword extraction">
+<meta name="author" content="Vids Samanta and Christian Grothoff">
+<meta name="rights" content="(C) 2002,2003,2004,2005,2006,2007,2009,2010 by
Vids Samanta and Christian Grothoff">
<meta name="keywords" content="keyword, extraction, mp3, html, pdf, images,
jpeg, gif, ps, mime, real, qt, asf, mpeg, avi, riff, tiff, summary, summaries,
kbps, format, mime-type, zip, elf, doc, ppt, xls, sha-1, md5, open office, sxw,
dvi, id3, id3v2, id3v2.3, id3v2.4, thumbnails, exiv2, nsf, sid, flv, flac">
<meta name="robots" content="index,follow">
<meta name="revisit-after" content="28 days">
@@ -17,7 +20,8 @@
<table width="99%" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr><td colspan="2" width="99%" bgcolor="#99bbff" align="center">GNU
libextractor - a simple library for keyword extraction</td></tr>
-<tr><td valign="top"><table width="15%" border="0" cellpadding="2"
cellspacing="3">
+<tr><td valign="top">
+<table width="15%" border="0" cellpadding="2" cellspacing="3">
<tbody>
<tr><th nowrap="nowrap" bgcolor="99BBFF"><a
href="http://www.gnu.org/software/libextractor/">Home</a></th></tr>
<tr><td bgcolor="efefef"><a href="#about">About</a></td></tr>
@@ -43,12 +47,13 @@
libextractor can be downloaded from this site or the <a
href="http://www.gnu.org/prep/ftp.html">GNU mirrors</a>.
</p>
<p>
-The goal is to provide developers of file-sharing networks or
-WWW-indexing bots with a universal library to obtain simple keywords to
-match against queries.
-libextractor contains a shell-command <tt>extract</tt> that, similar to the
-well-known <tt>file</tt> command, can extract meta-data from a file an print
-the results to stdout.
+The goal is to provide developers of file-sharing networks, browsers
+or WWW-indexing bots with a universal library to obtain simple
+keywords and meta data to match against queries and to show to users
+instead of only relying on filenames. libextractor contains a
+shell-command <tt>extract</tt> that, similar to the
+well-known <tt>file</tt> command, can extract meta data from a file an
+print the results to stdout.
</p>
<p>
Currently, libextractor supports the following formats:
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [GNUnet-SVN] r9950 - Extractor-docs/WWW,
gnunet <=