libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] Solaris, iconv and libextractor


From: Michał Kowalczuk
Subject: Re: [libextractor] Solaris, iconv and libextractor
Date: Wed, 12 Apr 2006 15:14:33 +0200
User-agent: Mail/News 1.5 (X11/20060122)

Christian Grothoff wrote:

> If you have any insights as to what is the exact encoding used by PDF here 
> (in 
> particular wrt to the iconv conversion call),  please let me know.

From PDF Reference, fifth edition, page 153:
#v+
3.8.1 Text Strings

Certain strings contain information that is intended to be human-readable,
such as text annotations, bookmark names, article names, document information,
and so forth. Such strings are referred to as text strings. Text strings are
encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding
is a superset of the ISO Latin 1 encoding and is documented in Appendix D.
Unicode is described in the Unicode Standard by the Unicode Consortium (see
the Bibliography).

For text strings encoded in Unicode, the first two bytes must be 254 followed
by 255. These two bytes represent the Unicode byte order marker, U+FEFF,
indicating that the string is encoded in the UTF-16BE (big-endian) encoding
scheme specified in the Unicode standard. (This mechanism precludes beginning
a string using PDFDocEncoding with the two characters thorn ydieresis, which
is unlikely to be a meaningful beginning of a word or phrase).
#v-

So you can safely treat s as UTF-16 encoded (iconv will detect that it is
UTF-16BE, because it contains BOM - Byte Order Mark) or force UTF-16BE on
(s+2), if s[0] == 254 && s[1] == 255. :)

And the Unicode-related code can be shortened to:

#v+
if ((((unsigned char)s[0]) & 0xff) == 0xfe &&
    (((unsigned char)s[1]) & 0xff) == 0xff) {
        /* is UTF-16BE */
        char * result;

        result = (char*) convertToUtf8((const char*) s,
                                       s1->getLength(), "UTF-16");

        next = addKeyword(type,
                          strdup(result),
                          next);
        free(result);
      } else {
        (...)
#v-

both in printInfoString() and printInfoDate().

-- 
greetings,
Michał Kowalczuk
Wirtualna Polska S.A.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]