gzz-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gzz] ``canon3_file_format``: A canonical, N3-based file format


From: Benja Fallenstein
Subject: [Gzz] ``canon3_file_format``: A canonical, N3-based file format
Date: Tue, 01 Apr 2003 21:45:50 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030327 Debian/1.3-4

=========================================================
``canon3_file_format``: A canonical, N3-based file format
=========================================================

:Author:        Benja Fallenstein
:Date:          2003-04-01
:Revision:      $Revision: 1.1 $
:Last-Modified: $Date: 2003/03/31 09:37:41 $
:Type:          Architecture
:Scope:         Major
:Status:        Current


We need a canonical file format for storing data in CVS
(canonical so that diffs will only show the differences
in structure, not changes because one RDF writer
chose to order triples differently than another writer
or so). This format could also be a potential candidate
for storing versions of RDF graphs in Storm.

This PEG specifies such a format.


Specification
=============

The name of the format is *Canon3*. This version is identified
by the URI <http://fenfire.org/2003/Canon3/1.0>. It is related to
both `Notation 3`_ and `NTriples`_. Canon3 files
are encoded as UTF-8, normalized to Unicode `Normalization Form C`_.
They obey the following grammar::

    document ::= header (triple)*
    header ::= "# Canon3 <http://fenfire.org/2003/Canon3/1.0/>" NEWLINE
    triple ::= subject " " property " " object "." NEWLINE
    subject ::= URItoken | anonNode
    property ::= URItoken
    object ::= URItoken | anonNode | literal
    URItoken ::= "<" URIref ">"
    anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*
    literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers
    qualifiers ::= ("@" language)? ("^^" URItoken)?

The ``NEWLINE`` token may be any of CR, LF, and CRLF.
(This is necessary for CVS to be useful across platforms.)
In contexts where the specific form used matters,
the newline character is LF. (In particular, when computing
a content hash-- e.g., when creating a Canon3 Storm block.)

The triples must be ordered. Two triples are compared
by comparing their subjects, properties, and objects
in this order. Each of these parts is compared
as follows:

- Literals are lower than (go before) URIrefs,
  which go before anonymous nodes.
- URIrefs are compared character-by-character,
  in the form as defined in [RFC 2396]
  (i.e., *after* Unicode characters outside
  the ASCII range have been escaped).
  Characters are compared by Unicode code point
  value.
- Literals are compared character-by-character
  in their unescaped form (i.e., before the
  backslash escaping defined below). If the
  strings of two literals are equal, first
  the language tag and then the data type,
  if any, are compared in the same manner.
  Literals without language tags/data types
  go before literals with them (if the
  contents of the literals are equal).
- Anonymous nodes are compared by their
  internal identifiers (the stuff following
  the ``_:``), also character-by-character.

A triple may only be listed once; if there are two
equal triples in the graph to be serialized, this
triple must occur only once in the serialization.

``URIref`` is a URI reference as defined in [RFC 2396].
Percent escapes (e.g. ``%2f``) should preferably
be encoded in lower case. URIref may be either of the following:

1. An absolute URI (e.g., ``http://example.org/``).
2. An absolute URI plus a fragment identifier
   (e.g., ``http://example.org/#foo``).
3. The empty URI reference (which is a relative URI
   refering to the current document).
4. A standalone fragment identifier (e.g., ``#foo``),
   refering to a fragment of the current document.

``language`` is a Language-Tag as defined by [RFC 3066].

A ``string`` is any UTF-8 character sequence
encoded in the following way:

- Double any backslash in the string.
- Insert a backslash before the first of any three
  consecutive double quotes (#x22) in the string.
  (This means: In a sequence of three or more
  double quote characters, instert a backslash
  before all but the last two double quotes).

For example, the string ``f\oo"""""ba"r`` becomes
``f\\oo\"\"\"""ba"r``.

Strings may contain newlines. Like all of Canon3,
they are encoded in Normalization Form C.
They are enclosed in triple double quotes
(see production ``literal``).

We will register a MIME type for Canon3.

\- Benja


.. _Normalization Form C: http://www.unicode.org/unicode/reports/tr15/
.. _NTriples: http://www.w3.org/TR/rdf-testcases/#ntriples
.. _Notation 3: http://www.w3.org/DesignIssues/Notation3.html






reply via email to

[Prev in Thread] Current Thread [Next in Thread]