[POKOLOGY] Values of the world, unite!

poke-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[POKOLOGY] Values of the world, unite! - Offsets in Poke

From:	Jose E. Marchesi
Subject:	[POKOLOGY] Values of the world, unite! - Offsets in Poke
Date:	Sun, 06 Oct 2019 03:33:32 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
[This is also published as an article in the Applied Pokology blog
 http://www.jemarch.net/pokology]

Early in the design of what is becoming GNU poke I was struck by a
problem that, to my surprise, would prove not easy to fix in a
satisfactory way: would I make a byte-oriented program, or a
bit-oriented program?  Considering that the program in question was
nothing less than an editor for binary data, this was no petty dilemma.

Since the very beginning I had a pretty clear idea of what I wanted to
achieve: a binary editor that would be capable of editing user defined
data structures, besides bytes and bits.  I also knew I needed some sort
of domain specific language to describe these structures and operate on
them.  How that language would look like, and what kind of abstractions
it would provide, however, was not clear to me.  Not at all.

So once I sketched an initial language design, barely something very
similar to C structs, I was determined to not continue with the poke
implementation until I had described as many as binary formats in my
language as possible.  That, I reckoned, was the only way to make sure
the implemented language would be expressive, complete and useful enough
to fulfill my requirements.

The first formats I implemented using my immature little language
included ELF, FLV, MP3, BSON... all of them describing structures based
on whole bytes.  Even when they try to be compact, it is always by
packing bit-fields in elements that are, invariably, sized as a multiple
of bytes.  Consequently, the language I was evolving became byte
oriented as well.  No doubt also influenced by my C inheritance, I would
think of bitfields either as a sort of second class citizen, or as mere
results of shifting and masking.

This worked well.  The language evolved to be able to express many
different aspects of these formats in a very nice way, like
variable-length data and holes in structures.  Consider the following
definition in what actually is _not_ valid Poke:

      deftype Data =
      struct
        {
          byte magic;
          byte count;
          byte dstart;

          byte[count] data @ dstart;
        };

The data starts with a byte that is a magic number.  Then the size of
the data stored, in bytes, and then the data itself.  This data,
however, doesn't start right after `dstart': it starts at `dstart',
which is expressed as an offset, in bytes, since the beginning of the
Data.  I conceived struct field labels to be any expression evaluating
to an integer, which would be... bytes, obviously.

Very nice but...

                             [Deflate!]

Then, one day, it was the turn for IETF RFC1951, which is the
specification of the DEFLATE algorithm and associated file format.  Oh
dear.  Near the beginning of the spec document it can be read:

   This document does not address the issue of the order in which bits
   of a byte are transmitted on a bit-sequential medium, since the final
   data format described here is byte- rather than bit-oriented.
   However, we describe the compressed block format in below, as a
   sequence of data elements of various bit lengths, not a sequence of
   bytes.

Then it goes on describing rules to pack the DEFLATE elements into
bytes.  I was appalled, and certainly sort of deflated as well.  The
purpose of my program was precisely to edit binary in terms of the data
elements described by a format.  And in this case, these data elements
came in all sort of bit lengths and alignments.  This can be seen in the
following RFC1951 excerpt, that describes the header of a compressed
block:

   Each block of compressed data begins with 3 header bits
   containing the following data:

       first bit       BFINAL
       next 2 bits     BTYPE
      
   Note that the header bits do not necessarily begin on a byte
   boundary, since a block does not necessarily occupy an integral
   number of bytes.

At this point I understood that my little language on the works would be
never capable to describe the DEFLATE structures naturally: C-like
bit-fields, masking and shifting, all based on byte-oriented containers
and boundaries, would never provide the slickness I wanted for my
editor.  I mean, just use C and get done with it.

This pissed me off.  Undoubtely other formats and protocols would be
impacted in a similar way.  Even when most formats are byte oriented,
what am I supposed to tell to the people hacking bit-oriented stuff?
"Sorry pal, this is not supported, this program is not for you"?  No
way, I thought, not on my watch.

                         [General Failure]

The obvious solution for the problem, is to be general.  In this case,
to express every offset and every memory size in bits instead of bytes.
While this obviously gives the language maximum expressiveness power,
and is ideal for expressing the few bit-oriented formats, it has the
disadvantage of being very inconvenient for most situations.

To see how annoying this is, let's revisit the little Data element we
saw above.  In a bit-oriented description language, we would need to
write something like:

      deftype Data =
      struct
        {
          byte magic;
          byte count;
          byte dstart;

          byte[count] data @ dstart * 8;
        };

Yeah... exactly.  The `*' and `8' keys in the keyboards of the poke
users would wear out very fast, not to mention their patience as well.
Also, should I provide both `sizeof' and `bitsizeof' operators?  Nasty.

I am very fond of the maxim "Never write a program you would never use
yourself[1], so I resigned myself to make GNU poke byte oriented, and to
provide as many facilities for operating on bit-fields as possible.

Fortunately, I have smart friends :)) 

                         [United Rabbits]

During one of the Rabbit Herd's Hacking Weekends
(http://jemarch.net/rhhw.html) I shared my frustration and struggle with
the other rabbits, and we came to realize that offsets and data sizes in
Poke should not be pure magnitudes or mere integer values: they should
be united.  They should have units.

It makes full sense when you come to think about it.  For a program like
poke, it is only natural to talk about different memory units, like
bits, bytes, kilobytes, megabits, and so on.  Bits and bytes are just
too common units.  Apart from allowing me to express values in different
units, it also has other benefits as we will see shortly.

I'm really grateful to Bruno Haible, Luca Saiu and Nacho Gonzalez for
putting me on the right track.

So, armed with this insight, I developed the idea fast.  The first step
was to come with a convenient syntax to denote these united values,
which I called "offsets" (because in a binary editor you mostly use them
to denote offsets in the file you are editing).  After a not very
successful attempt (where [12 b] would denote twelve bits) I settled in
what poke supports today:

      12#B
      7#b
      1024#Kb

The offsets above denote twelve bytes, seven bits and one thousand
twenty four kilobytes, respectively.  The unit can be separated from the
magnitude by blank characters, so you can write the following instead if
you are so inclined:

      12 #B
      7 #b
      (1024 * 1024) #Kb

Then came a little algebra for offsets:

      OFFSET + OFFSET -> OFFSET
      OFFSET - OFFSET -> OFFSET
      OFFSET * MAGNITUDE -> OFFSET
      OFFSET / OFFSET -> MAGNITUDE
      OFFSET % OFFSET -> OFFSET

The addition or subtraction of two offsets results in another offset:
`1#B + 1#b -> 7#b'.  Multiplying an offset by a magnitude gives you
another offset[2]: `8#b * 2 -> 2#B'.  Dividing two offsets gives you a
magnitude: `16#b / 1#B -> 2'.  Finally, the modulus of two offsets is
another offset: `17#b / 1#B -> 1#b'.

These offsets soon proved a very comfortable and smooth way to convert
memory magnitudes between different units: just use units like you do
when doing physics or working with units in other contexts.  To promote
this idea, I also added a little syntactic trick: if the magnitude of an
offset is 1, you can omit it.  This allows you to write things like:

      24 #Kb/#B

Which gives you the number of bytes in 24 kilobytes.

Introducing offsets as first class citizen values in the Poke language
gave me exactly what I was aiming for.  Typical byte oriented constructs
can still be expressed naturally:

      deftype Data =
        struct
        {
          byte magic;
          byte count;
          offset&lt;byte,B&gt; dstart;

          byte[count] data @ dstart;
        };

But also the odd bit-oriented format:

      deftype BitData =
        struct
        {
          byte magic;
          byte count;
          offset&lt;byte,b&gt; dstart;

          byte[count] data @ dstart;
        };

                 [General Satisfaction]

Once I had a first implementation of the offsets and their operators
working, I asked myself what would be the right set of units to offer to
users.  In the offset syntax, units are specified as #FOO, where FOO is
the name of the unit.  I had a list of hardcoded units like `b' for
bits, `B' for bytes, `Kb' and so on.  But what if the unit that _you_
want is not among them?

So I also allowed to express units in multiples of the base unit, which
is the bit.  Using this syntax, you can express offsets in any arbitrary
unit, as disparate as it may seem:

      17#3
      0#12
      8#1

Thats it, 17 units of 3 bits each, zero units of 12 bits each, and eight
units of 1 bit each.  Sweet!

But then, why stopping there?  Pxoking is all about defining data
structures and operating on them... so why not using these structures as
units as well?  Consider the following struct:

      deftype Packet = struct { int i; long j; };

Unlike the Data struct above, the size of a Packet is known at compile
time.  Wouldn't it be nice to use it as a unit in offsets?  Sure it is:

      23#Packet

The above is the size occupied by 23 packets.  And how many
kilobytes are in 23 packets?  Easy:

      23 #Packet/#Kb

Expressing offsets as united values also relieves the programmer from
doing many explicit unit conversions: poke can do them for you.
Consider for example an ELF section header.  One of it's fields is the
size of the described section, in bytes:

      deftype Elf64_Shdr =
        struct
        {
         ...
         offset&lt;Elf64_Xword,B&gt; sh_size;
         ...
        };

If a given section is to contain, say, relocations with addends, we can
set it's size doing something like this:

      shdr.sh_size = 10#Elf64_Rela;

Instead of doing the conversion to bytes explicitly.

Finally, having fields in structs as offsets also makes poke aware of
what is a reference to other parts of the file.  This will be used in
the near future when we add support for resizing structures
automagically... yeah that's coming.

Happy poking! :)
    
Footnotes:

[1] Actually it is Lord Vetinari's "Never build a dungeon you can't get
    out of." but the point is the same.

[2] Considering memory is linear, in principle multiplying two offsets
      woudn't make a lot of sense... then again, something like `2#B *
      8#B -> 16#B could maybe be useful when working with tables... hmmm
      tempting ;)
[Prev in Thread]
Current Thread
[Next in Thread]
[POKOLOGY] Values of the world, unite! - Offsets in Poke, Jose E. Marchesi <=
- Re: [POKOLOGY] Values of the world, unite! - Offsets in Poke, Jose E. Marchesi, 2019/10/05
Prev by Date: TODO file
Next by Date: Re: [POKOLOGY] Values of the world, unite! - Offsets in Poke
Previous by thread: TODO file
Next by thread: Re: [POKOLOGY] Values of the world, unite! - Offsets in Poke
Index(es):
- Date
- Thread