gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: libextractor - key-value pairs and mime types


From: Christian Grothoff
Subject: Re: libextractor - key-value pairs and mime types
Date: Tue, 8 Feb 2022 08:59:30 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0

Hi madmurphy,

The 'correct' place for GNU libextractor discussions would be

  https://lists.gnu.org/mailman/listinfo/libextractor

Alas, with my libextractor maintainer hat on, I would say this:

On 2/7/22 10:01 PM, madmurphy wrote:
> Hi again, GNUnet people.
> 
> Is this the place where to discuss about libextractor? I have two points.
> 
> #1 I often see something interesting. Key-value pairs are categorized as
> |EXTRACTOR_METATYPE_UNKNOWN|:
> 
> unknown: chroma-format=4:2:0
> unknown: bit-depth-chroma=8
> unknown: colorimetry=bt709
> unknown: stream-format=avc
> unknown: stream-format=raw
> unknown: bit-depth-luma=8
> unknown: base-profile=lc
> unknown: mpegversion=4
> unknown: profile=high
> unknown: alignment=au
> unknown: parsed=true
> unknown: framed=true
> unknown: variant=iso
> unknown: profile=lc
> unknown: level=4.1
> 
> But one point is that they are often numerous, and another point is that
> that of a key-value type is a really interesting metatype to have (and
> is not really “unknown”, since the key is self-explanatory). Would it
> not make sense to add an |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| to the list
> of MetaTypes?

We could do that. Sometimes I think it would be better to add new
specific LE types for some of the above, but until that is done, a
key-value pair type would at least be better than 'unknown'.

> ...
> 
>   /* generic attributes */
>   EXTRACTOR_METATYPE_UNKNOWN = 45,
>   EXTRACTOR_METATYPE_DESCRIPTION = 46,
>   EXTRACTOR_METATYPE_COPYRIGHT = 47,
>   EXTRACTOR_METATYPE_RIGHTS = 48,
>   EXTRACTOR_METATYPE_KEYWORDS = 49,
>   EXTRACTOR_METATYPE_ABSTRACT = 50,
>   EXTRACTOR_METATYPE_SUMMARY = 51,
>   EXTRACTOR_METATYPE_SUBJECT = 52,
>   EXTRACTOR_METATYPE_CREATOR = 53,
>   EXTRACTOR_METATYPE_FORMAT = 54,
>   EXTRACTOR_METATYPE_FORMAT_VERSION = 55,
>   *EXTRACTOR_METATYPE_KEY_VALUE_PAIR* = XXX,
> 
> ...
> 
> #2 I often see that files get tagged with multiple mime types according
> to libextractor:
> 
> mimetype: video/quicktime
> mimetype: video/x-h264
> mimetype: audio/mpeg
> mimetype: video/mp4

That is because different plugins (using different methods/libraries)
disagree on the 'correct' mime-type. Ideally, we'd identify which plugin
gets it wrong (and why), and unify the mime-types.

> But that never reflects the reality, since files should have only one
> mime type (or at most, multiple mime types that mean the same thing).
> But then I see what happens with file names: there is only one
> |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME|, but there can be many
> |EXTRACTOR_METATYPE_FILENAME|s (in the case of archives, for example):
> 
> EXTRACTOR_METATYPE_FILENAME = 2,
> ...
> EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME = 180,
> 
> Would it not make sense to do something similar for mime types? Only one
> “original mime type”, and an infinity of secondary mime types…?
> 
> EXTRACTOR_METATYPE_MIMETYPE = 1,
> ...
> *EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE* = XXX,

I guess it depends. If this is for archives where files _inside_ the
archive are given mime-types, then a different metatype makes sense
(ditto for FILENAME: here we probably could have two types, one for the
'archive' and one for the 'contents'). But if the different mime-types
are all about the 'original' file, then we should rather figure out
which plugin gets it wrong. As for the "_GNUNET_" in the
"_GNUNET_ORIGINAL_FILENAME" there, IIRC this again different because
that is NOT a metatype used by GNU libextractor, but one that GNUnet
itself generates and puts with the 'rest ' of the metadata.

> So, two simple proposals:
> 
>  1. Create |EXTRACTOR_METATYPE_KEY_VALUE_PAIR|
>  2. Create |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE|
> 
> What do you think? Does it make sense?

It should definitively not be "GNUNET_ORIGINAL_MIMETYPE", and the real
question is what is the origin of the different mime-types. If this is
from an archive, maybe we should introduce

EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_FILENAME
EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_MIMETYPE

and reserve

EXTRACTOR_MIMETYPE_FILENAME
EXTRACTOR_MIMETYPE_MIMETYPE

for the top-level file. But AFAIK that won't solve your mime-type issue,
which should really be resolved by going over the plugins and finding
out why and where they disagree and picking the 'right' answer.

My 2 cents

Christian



reply via email to

[Prev in Thread] Current Thread [Next in Thread]