[POKOLOGY] Dealing with alternatives

poke-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[POKOLOGY] Dealing with alternatives - Unions in Poke

From:	Jose E. Marchesi
Subject:	[POKOLOGY] Dealing with alternatives - Unions in Poke
Date:	Fri, 18 Oct 2019 13:05:11 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
[This is also published as an article in the Applied Pokology blog
 http://www.jemarch.net/pokology]

The Poke type definitions can be seen as a sort of declarative
specifications for decoding and encoding procedures.  You specify the
structure of the data you want to operate on, and poke uses that
information to automagically decode and encode the data for you.  Under
this perspective, struct types correspond to sequences of instructions,
array types to repetitions or loops, and union types to alternatives or
conditionals.

                     [Decode-Compute-Encode]

Computing with data whose form is not the most convenient way to be
manipulated, like is often the case in unstructured binary data,
requires performing a preliminary step that transforms the data into a
more convenient representation, usually featuring a higher level of
abstraction.  This step is known in computer jargon as "unmarshalling",
when the data is fetch from some storage or transmission media or, more
generally, "decoding".

Once the computation has been performed, the result should be
transformed back to the low-level representation to be stored or
transmitted.  This is performed in a closing step known as "marshalling"
or, more generally, "encoding".

Consider the following C program whose purpose is to read a 32-bit
signed integer from a byte-oriented storage media at a given offset,
multiply it by two, and store the result at the same offset.

void double_number (int fd, off_t offset, int endian)
{
   int number, i;
   int b[4];

   /* Decode.  */
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      read (fd, &amp;b[i], 1);

   if (endian == BIG)
     number = b[0] << 24 | b[1] << 16 | b[2] << 8 | b[3];
   else                                                      
     number = b[3] << 24 | b[2] << 16 | b[1] << 8 | b[0];
                                                      
   /* Compute.  */
   number = number * 2;

   /* Encode.  */
   if (endian == BIG)
   {
     b[0] = (number >> 24) & 0xff;
     b[1] = (number >> 16) & 0xff;
     b[2] = (number >> 8) & 0xff;
     b[3] = number & 0xff;                            
   }
   else
   {
     b[3] = (number >> 24) & 0xff;
     b[2] = (number >> 16) & 0xff;
     b[1] = (number >> 8) & 0xff;
     b[0] = number & 0xff;                            
   }
                                                      
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      write (fd, &amp;b[i], 1);
}

As we can see, decoding takes care of fetching the data from the storage
in simple units, bytes.  Then it mounts the more abstract entity on
which the computation will be performed, in this case a signed 32-bit
integer.  Considerations like endianness, negative encoding (which is
assumed to be two's complement in this example and handled automatically
by C) and error conditions (omitted in this example for clarity) should
be handled properly.

Conversely, encoding turns the signed 32-bit integer into a sequence of
bytes and then writes them out to the storage at the desired offset.
Again, this requires taking endianness into account and handling error
conditions.

This example may look simplistic and artificial, and it is, but too
often the computation proper (like multiplying the integer by two) is
way more straightforward than the decoding and encoding of the data used
for the computation.

Generally speaking, decoding and encoding binary data is laborious and
error prone.  Think about sequences of elements, variable-length and
clever compact encodings, elements not aligned to byte boundaries, the
always bug-prone endianness, and a long etc.  Dirty business, sometimes
risky, and always _boring_.

                         [Describe-Compute]

This is where poke comes into play.  Basically, it allows you to
describe the characteristics of the data you want to compute on, and
then decodes and encodes it for you, taking care of the gory details.
That way you can concentrate your energy on the _fun_ part: computing on
the data at your pleasure.

Of course, you are still required to provide a description of the data.
In Poke, these descriptions take the form of type definitions, which are
declarative: you specify _what_ you want, and poke extracts the _how_
from that.

For example, consider the following Poke type definition:

deftype Packet =
  struct
  {
    uint<16> magic : magic == 0xef;
    uint<32> size;
    byte[size] data @ 64#B;
  };

This tells poke that, in order to decode a Packet, it should perform the
following procedure (a similar procedure is implied for encoding):

. Read four bytes from the IO space, mount them into an unsigned 16-bit
  integer using whatever current endianness, and put it in `magic'.

. Read eight bytes, mount them into an unsigned 32-bit integer
  using the same endianness, and put it in `size'.
  
. Seek the IO space to advance 16 bits.

. Do `size' times:

  . Read one byte and mount it into an unsigned 8-bit integer.
  . Put the integer in the proper place in the `data' array.

If during this procedure an end of file is encountered, or a constraint
fails, an appropriate error is raised.

In the procedure sketched above we find a sequence of operations,
implied by the struct type, and a loop, implied by the array type.  As
we shall see below, it is also possible to decode conditionally.  Union
types are used for that purpose.

                                  [BSON]

BSON (http://bsonspec.org/spec.html) is a binary encoding for JSON
documents.  The top-level entity described by the spec is the
"document", which contains the total size of the document, a sequence of
elements, and a trailing null byte.

We can describe the above in Poke with the following type definition:

deftype BSON_Doc =
  struct
  {
    offset<int32,B> size;
    BSON_Elem[size - size'size - 1#B] elements;
    byte endmark : endmark == 0x0;
  };

BSON elements come in different kinds, which correspond to the different
types of JSON entities: 32-bit integers, 64-bit integers, strings,
arrays, timestamps, and so on.  Every element starts with a tag, which
is a 8-bit unsigned integer that identifies it's kind, and a name
encoded as a NULL-terminated string.  What comes next depends on the
kind of element.

The following Poke type definition describes a subset of BSON elements,
namely integers, big integers and strings:

deftype BSON_Elem =
  struct
  {
    byte tag;
    string name;

    union
    {
      int32 integer32 : tag == 0x10;
      int64 integer64 : tag == 0x12;
      BSON_String str : tag == 0x02;
    } value;
  };

The union in BSON_Elem corresponds to the variable part.  When poke
decodes an union, it tries to decode each alternative (union field) in
turn.  The first alternative that is successfully decoded without
raising a constraint violation exception is the chosen one.  If no
alternative can be decoded, a constraint violation exception is raised.

To see this process in action, let's use the BSON corresponding
to the following little JSON document [1]:

{
    "name" : "Jose E. Marchesi",
    "age" : 39,
    "big" : 1076543210012345
}

Let's take a look to the different elements:

(poke) .load bson.pk
(poke) defvar d = BSON_Doc @ 0#B
(poke) d.elements'length
0x3UL
(poke) d.elements[0]
BSON_Elem {tag=0x2UB,name="name",value=struct {str=BSON_String 
{size=0x11,value="Jose E. 
Marchesi",chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem {tag=0x10UB,name="age",value=struct {integer32=0x27}}
(poke) d.elements[2]
BSON_Elem {tag=0x12UB,name="big",value=struct {integer64=0x3d31c3f9e3eb9L}}

Note how unions decode into structs featuring different fields.  What
field is available depends on the alternative chosen while decoding.

In the session above, d.elements[1].value contains an integer32 field,
whereas d.elements[2].value contains an integer64 field.  Let's see what
happens if we try to access the wrong field:

(poke) d.elements[1].value.integer64
unhandled invalid element exception

We get a run-time exception.  This kind of errors cannot be catched at
compile time, since both integer32 and integer64> are potentially valid
fields in the union value.

                          [Unions are Tagged]

Unlike in C, Poke unions are tagged.  Unions have their own lexical
closures, and it is the capured values that determine what field is
chosen at every time.  Wherever the union goes, its tag accompanies it.

To see this more clearly, consider the following alternative
Poke description of the BSON elements:

deftype BSON_Elem =
  union
  {
    struct
    {
      byte tag : tag == 0x10;
      string name;
      int32 value;
    } integer32;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      int64 value;
    } integer64;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      BSON_String value;
    } str;
  };

This description is way more verbose, but it shows a few important
properties of Poke unions.

First, the constraints guiding the decoding are not required to appear
in the union itself: it is a recursive process.  In this example,
BSON_String could have constraints on it's own, and these constraints
will also impact the decoding of the union.

Second, there are generally many different ways to express the same
plain binary using different type structures.  This is no different than
getting different parse trees from the same sequence of tokens using
different grammars denoting the same language.

See how different a BSON element looks (and feels) using this
alternative description:

(poke) d.elements[0]
BSON_Elem {str=struct {tag=0x2UB,name="name",value=BSON_String 
{size=0x11,value="Jose E. 
Marchesi",chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem {integer32=struct {tag=0x10UB,name="age",value=0x27}}
(poke) d.elements[2]
BSON_Elem {integer64=struct {tag=0x12UB,name="big",value=0x3d31c3f9e3eb9L}}

What is the best way? It certainly depends on the kind of data you want
to manipulate, and the level of abstraction you want to achieve.
Ultimately, it is up to you.

Generally speaking, the best structuring is the one that allows you to
manipulate the data in terms of the structured abstractions as naturally
as possible.  That's the art and craft of writing good pickles.

GNU poke is evolving to facilitate this task as much as possible.
Features like struct flattening and unification of union fields (i.e. to
be able to have different alternatives with the same name but
potentially different types) are being worked on, so stay tuned.

Happy poking! :)
    
Footnotes:

[1] To generate the BSON from the JSON I used libbson and its little
    example program json2bson.
[Prev in Thread]
Current Thread
[Next in Thread]
[POKOLOGY] Dealing with alternatives - Unions in Poke, Jose E. Marchesi <=
Prev by Date: Re: grrr, GC disabled
Next by Date: Styled output in Poke programs!
Previous by thread: grrr, GC disabled
Next by thread: Styled output in Poke programs!
Index(es):
- Date
- Thread