[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] * grub-core/fs/udf.c: Add support for UUID
From: |
Pali Rohár |
Subject: |
Re: [PATCH] * grub-core/fs/udf.c: Add support for UUID |
Date: |
Fri, 12 May 2017 16:39:14 +0200 |
User-agent: |
KMail/1.13.7 (Linux/3.13.0-117-generic; KDE/4.14.2; x86_64; ; ) |
On Monday 08 May 2017 15:13:28 Vladimir 'phcoder' Serbinenko wrote:
> On Mon, Apr 10, 2017, 23:17 Pali Rohár <address@hidden> wrote:
> > -read_string (const grub_uint8_t *raw, grub_size_t sz, char
> > *outbuf) +read_string (const grub_uint8_t *raw, grub_size_t sz,
> > char *outbuf, int normalize_utf8)
>
> Normalize isn't the right word. And it's not utf-8 but latin1 (called
> compressed utf-16 by udf docs).
> Are you sure you handle utf-16 case correctly? What is the expected
> behavior in those cases? Ideally you may want to just parse raw
> string in caller
Hi! Now I looked at OSTA UDF spec again and found reason for my
disinformation... libblkid has wrongly implemented 8bit OSTA compressed
unicode and I just tried to mimic libblkid in grub...
libblkid handles 16bit OSTA compressed unicode as UTF-16BE and 8bit OSTA
compressed unicode as UTF-8.
In UDF 2.01 specification is written:
====
For a CompressionID of 8 or 16, the value of the CompressionID shall
specify the number of BitsPerCharacter for the d-characters defined in
the CharacterBitStream field. Each sequence of CompressionID bits in the
CharacterBitStream field shall represent an OSTA Compressed Unicode d-
character. The bits of the character being encoded shall be added to the
CharacterBitStream from most- to least-significant-bit. The bits shall
be added to the CharacterBitStream starting from the most significant
bit of the current byte being encoded into. The value of the OSTA
Compressed Unicode d-character interpreted as a Uint16 defines the value
of the corresponding d-character in the Unicode 2.0 standard.
====
So it means that 8bit OSTA compressed unicode buffer contains sequence
of Unicode codepoints, one per 8 bits. What effectively means
equivalence with Latin1 (ISO-8859-1) encoding.
And 16bit OSTA compressed unicode means sequence of Unicode codepoints,
one per 16 bits in big endian. What is probably only UCS-2 and not full
UTF-16.
So problem is with 8bit OSTA compressed unicode if contains bytes which
are not UTF-8 invariants (ASCII). As those those are decoded differently
with Latin1 and UTF-8.
(Please correct me if I'm wrong here)
For now rather scratch/suspend this my patch until we decide what to do
with it due to different/wrong implementation of reading strings in
libblkid from util-linux.
--
Pali Rohár
address@hidden
signature.asc
Description: This is a digitally signed message part.