[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Encoding handling proposal
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] Encoding handling proposal |
Date: |
Mon, 30 Aug 2004 13:13:47 -0700 (PDT) |
> From: Marcus Sundman <address@hidden>
> A) There should be support for both mandatory and optional metadata
> attributes associated with each file in the repository.
Agreed.
> B) "Content-Type" should be a mandatory metadata string attribute.
Quite possibly. Other alternatives should be explored. For
example, instead of a content-type, perhaps the name of some region of
the arch namespace?
The arch namespace purports to be a good system for naming
human-constructed artifacts that may evolve over time and relate to
one another (in roughly "branching and merging" type ways). The set
of valid content-type's is one example of such a class of artifacts.
The question is, so it goes, who is to be master? Who is to "own"
these standard namespaces, such as that of "content-type's"?
If the answer is "no one", then what is the alternative to a tower of
babble?
Perhaps the answer is, in part, the arch namespace. When used
cooperatively, it allows anyone to declare themselves a unique
"authority" about some mapping of name to value. For whatever
community of reference honors that particular archive registration,
that person therefore *is* an authority, in charge of a shared region
of a cooperatively constructed global namespace.
(And: if the arch namespace is used as the space for "cooperative
standards" -- then it's (still young and emerging) quasi-algebra for
branching and merging enables the possability of "dual citizenship"
between otherwise unlinked communities of cooperation.)
Therefore, the arch namespace is an interesting alternative to
IETF-goverened namespaces. It's a political question: which approach
is better? or, better still, can they be usefully combined?
-t
[I'm in a rush with lots to do so, I'll just say I haven't read the
rest (sorry -- dropped packet) but liked those first couple of points
and wanted to get my licks in on this topic.]
> C) "Auto-Filter" should be a mandatory metadata boolean attribute.
>
> D) There should be a filter/plugin architecture to enable a transcoding
of
> files on input and output based on their content-types and user settings
> and user-provided parameters.
>
> E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should
be
> provided by plugins mapped to content-types.
>
> F) Commit comments and other string attributes should use UTF-8.
>
> G) Filenames and paths should use UTF-8 in the repository, and be
transcoded
> to the proper encoding when a client accesses the local file system.
>
>
> Notes:
>
> A) There are already some mandatory metadata associated with each file.
One
> such attribute is the name of the file.
>
> B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
> All text/* types may include the "charset" parameter (MIME defines
"charset"
> as "character encoding" and not just as a simple character set), and if
> absent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8
bits/char
> with the most significant bit set to 0 (zero)"), as per RFC 2046.
> This is a very common and established standard used in many different
> systems including, but not limited to, file managers, http and email.
>
> C) If Auto-Filter is set to "true" then content transcoding will occur
> between the repository and the local system. If it is set to "false" then
> no transcoding is done.
> Each project may have its own default Auto-Filter values for different
file
> types.
>
> D) Since editors and other programmers' tools tend to use whatever the
local
> system encoding happens to be and a project might include people with
> different systems there needs to be some transcoding of most text files.
> The contents of files whose "Auto-Filter" attribute is set to "true" will
be
> stored UTF-8 encoded with U+2028 newlines in the repository and
transcoded
> from/to the local encoding and local newlines on input/output. The
contents
> of files whose "Auto-Filter" attribute is set to "false" will not be
> transcoded on input/output.
> Often the proper local encoding and line breaks can be detected
> automatically, but the user should be able to override the auto-detection
> in his settings and/or by a parameter to the cm client.
>
> E) E.g. if two files with the content-type
"application/vnd.sun.xml.writer"
> are diffed the system should use a diff plugin that knows how to
interpret
> OpenOffice.org Writer documents. If no such plugin is found it defaults
to
> the standard diff which regards the files as byte blobs.
>
> F) UTF-8 should be used for communication between the client and the
server.
> Internally the server might store the strings in any encoding it wants in
> the repository, but I'd recommend keeping them UTF-8 encoded for
simplicity
> and consistency.
>
> G) Each character in a file name/path not possible to transcode to the
> target file system encoding should be replaced with the character
sequence
> "{uN}" where N is the hexadecimal unicode code (e.g. a file named
> "hello<>world" would be named "hello{u3C}{u3E}world" on windows). This
> results in the limitation that filenames must not contain a character
> sequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}".
> Whenever a filename or path is used in an URI the UTF-8 bytes should be
> properly URI-encoded.
> Often the proper local encoding can be detected automatically, but the
user
> should be able to override the auto-detection in his settings and/or by a
> parameter to the cm client.
> Internally the server might store the strings in any encoding it wants in
> the repository, but I'd recommend keeping them UTF-8 encoded for
simplicity
> and consistency.
>
>
> Notice that there is no distinction between "text files" and "binary
files".
> The same system that converts between different text encodings might just
> as well be used to convert between different "raw" audio formats. Just
add
> the appropriate plugin/filter and you're set.
>
>
> - Marcus Sundman
>
>
> _______________________________________________
> Gnu-arch-users mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/gnu-arch-users
>
> GNU arch home page:
> http://savannah.gnu.org/projects/gnu-arch/
>
>
- Re: [Gnu-arch-users] Encoding handling proposal, (continued)
- Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/29
- Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
- Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/30
- Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/30
- Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/30
Re: [Gnu-arch-users] Encoding handling proposal, Alexey N. Solofnenko, 2004/08/29
Re: [Gnu-arch-users] Encoding handling proposal, David Allouche, 2004/08/30
[Gnu-arch-users] Re: Encoding handling proposal, Stefan Monnier, 2004/08/30
Re: [Gnu-arch-users] Encoding handling proposal,
Tom Lord <=