chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-hackers] Serialization of tinyclos objects in Chicken Scheme


From: Tony Sidaway
Subject: [Chicken-hackers] Serialization of tinyclos objects in Chicken Scheme
Date: Thu, 3 Dec 2009 00:40:24 +0000

The s11n (serialization) egg is very concise and simple, but
efficiently renders much Chicken Scheme data and even some executable
code into a portable form that can be stored and transmitted and then
restored through a deserialization process that may optionally account
for major differences in machine architectures such as word size and
byte order.

Alternatives exist, such as the PHP serialization egg, enabling
Chicken Scheme code to interoperate with PHP code that uses that
language's native serialization and deserialization methods.  Since my
interest is the use of serialization to implement persistent tinyclos
objects in Chicken Scheme, I have concentrated on the native s11n egg,
which provides a much closer fit to the Scheme language.

The s11n serializer is a recursive-descent parser that "walks" the
structured data, and encodes each leaf of the walked parser tree as a
single-byte data type tag followed by an encoding appropriate to the
basic data type being encoded.  The deserializer is a LR parser that
decodes the data according to the date type tags in the input stream.

A naive approach to serializing complex objects such as tinyclos
instance structures (classes and objects) using the s11n egg fails
because s11n as implemented is only able to serialize procedures if it
can resolve them using C_lookup_procedure.  This does not work, for
instance, with the field initializers, getters and setters and so on
that are embedded in every tinyclos object.  Even if this naive
approach worked it would produce large deserialized output that, on
deserialization, would duplicate the class hierarchy of every single
deserialized object.

In practice the serialization of class definitions and even class
instances per se is undesirable as well as complex.  For most
practical purposes, limiting the scope of serialization to identifying
the class and the slot contents is the most economical and most useful
strategy.  While the notion of class definitions that can be stored on
external media and transmitted as data is most easily envisioned in
Lisp and especially in Scheme, that is beyond the scope of my current
project.  Chicken Scheme's dynamic module load facility provides a
very usable and familiar means of storing class definitions in a form
that has most of the desirable properties of serialization.

One of the limitations of s11n as it stands at present is that,
although there is a provision for a fallback serializer to be used
when the built-in algorithm fails, this is only invoked at the leaf
level.  The problem here is that the serializer might already have
embarked on the task of serializing a complex structure, only to find
unserializable components.  The only thing a fallback could do here is
to insert a substitute for the unserializable component (for instance,
unserializable procedures could be replaced by references to the
procedure "void").  But this doesn't really help because, on
deserialization, one ends up with a complex structure that would not
make semantic sense within the context of a Scheme program.

The root of the problem is that this is a bottom-up approach to
exception handling.  Exceptions are only discovered at the leaf level,
ignoring readily available intelligence provided at higher levels of
organization.

It makes more sense to adopt a top-down approach, and since s11n is a
recursive-descent parser that's easy to do.  Most complex structures
are represented as vectors, so an optional argument providing an
alternative processor for vector structures would enable the caller to
provide alternative codings for selected structures which would make
more sense than the built-in code which naively stores the vector
header and then walks all of the slots by recursive descent.

In the case of non-class tinyclos objects, as discussed above, it
makes perfect sense to serialize only class identification data and a
list of slot values.  A slot serializer and deserializer are trivial
to write: produce a vector of consecutive slot values and serialize
it, and to deserialize take a vector of slot values and place them
into the target object by consecutive ordering of slot.

The requirements of class identification depend very much on the
context of the application.  A small private application using, say,
64 or fewer distinct classes would benefit from a compact strategy in
which each distinct class has its own data type tag. Thus I suggest
that, of the 256 possible byte-sized data type tags (only 20 of which
are in use), 64 should be set aside for user-implemented compact
encoding of tinyclos objects.  Typically a small application might
provide a means of associating data type tags with classes, and vice
versa.  Serialization would comprise emitting the appropriate data
type tag (a single byte) followed by a vector containing the
serialized slot values.  Deserialization would comprise looking up the
data type tag, creating an object of the associated class, and
populating the object slots from the serialized slot value vector that
follows.  This is quite compact and easy for a small application to
implement and control.

A further tag should be reserved for classes identified by a Scheme
object.  When that tag appears in serialized data, it should be
followed by a scheme object (preceded by its appropriate data type
tag) identifying the class.

The serialization and deserialization strategies in this case involve
maintaining an association between class identifiers and classes.  The
technique is essentially identical to the compact technique except
that, instead of relying on the value of the data type tag, the
distinguisher for class identity is the data object that follows it.
After that a slot value vector follows the class identifier.

Finally, a tag should be reserved for classes to be identified by
strings, and another for classed to be identified by Universally
Unique Identifiers (UUIDs).

A UUID is a 128-bit hash used as a unique identifier for an object
(the most widely known implementation is Microsofts, where they are
known as GUIDs, where the G stands for Global).  While there are
different strategies for generating UUIDs (and we have two Chicken
eggs that implement practically all of the common strategies), the
basic principle is that the 128-bit namespace is so vast that
unintentional collision of identifiers is vanishingly unlikely.

The tag would be followed by 16 bytes containing the UUID.  UUID-based
class identification would enable the creation of a central class
registry so that, in principle, a deserialization procedure could, by
consulting registry data, dynamically load a module containing a
suitable class definition and execute code to create an object of the
appropriate class into which to insert the deserialized slot data that
follows in the deserialization stream.

String-identified classes would be identified by C strings (sequences
of bytes terminated by a zero byte).  The string class data type
identifier would be followed by a C string identifying the class.

In principle the C string and UUID techniques could be subsumed into
the general data identifier technique, but having a separate data type
tag for these techniques saves some overhead associated with that more
general technique.  The length and format of a UUID, and of a C
string, are known in advance, so further length and data type
information can be omitted.

Typically the C string technique would use the tinyclos "class-name"
procedure for encoding.  Since tinyclos provides no direct means of
associating class names back to classes, the application would have to
provide such an association.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]