[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gzz] Re: the Storm article
From: |
Benja Fallenstein |
Subject: |
Re: [Gzz] Re: the Storm article |
Date: |
Fri, 07 Mar 2003 18:08:38 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021226 Debian/1.2.1-9 |
Alatalo Toni wrote:
On Thu, 6 Mar 2003, Eric Armstrong wrote:
First, I liked the article. A lot. I'm looking
forward to using Storm, and so is Eugene (only
took a 30 second presentation to get him
interested).
glad to hear, and thank you very much about the detailed comments!
Me too :)
I even started a review article "Taking the World
by Storm", for publication in some broad-interest
journal. (Not sure which one, though.)
Cool!
Next, specifics.
Thanks for the comments. I'll also answer questions here, so that you
don't need to wait for the next version of the article to get them
answered :)
Missing Ingredients
-------------------
These are things that need to be addressed in the
article, however briefly, but are not currently
mentioned:
* How are collisions handled?
(Surely some small blocks must produce the
same cryptographic hash as other small blocks,
sometimes.)
afaik it should not happen, but it is theoretically possible. so i guess
it's a good question :)
We assume that it doesn't happen.
a) It's extremely unlikely (AFAIK you'd need about 2^80 blocks in a
single lookup system to find a collision by chance).
b) A supercomputer or distributed.net effort dedicated to find a
collision by brute force seems to be more likely to find a collision,
because it would dedicate all its computation time to this, instead of
only producing a few blocks per hour(?) and system.
But once a supercomputer or distributed.net effort able to find a
collision by brute force becomes feasible, the hash function isn't
secure any more anyway.
That they aren't feasible yet, I trust the experts to evaluate (i.e. the
cryptographers, e.g. those designing these hash functions).
* How are docs hashed? I didn't see a discussion of that.
Versions of docs are blocks and blocks are hashed by content. Does that
answer the question?
* What is the project storage impact?
(Maybe only "publish" material goes into the system,
or maybe storage is cheap and growing cheaper so
we don't really care, but it needs to be mentioned.)
Again not sure what you mean, sorry. How would it be different from
classical file system storage? (Except that we need to store media files
like images only once even when they're used in different documents on
the same computer...)
If you refer to storing past versions, I understand. This is a general
problem with versioned storage. We use the diff scheme to limit the
storage needed there. Also we allow deleting past versions :-)
* What language is it written in?
(Or do I care? If it really is like a "file system",
maybe I really don't?)
mostly Java, otherwise ex-gzz (now Fenfire) has been written also in
Python (Jython, for tests, demos and clients at least) and C++ (opengl
graphics API) .. but all Storm code I've seen is Java.
Currently Storm is in Java. If others want to adopt it, we'd welcome the
contribution of implementations in other languages, obviously.
* If there really is a "file system" gui, that's still
going to be different from a shell, because I won't
be able to launch any existing editors, will I? They'll
need to write new files, not rewrite old ones -- and
they'll need to understand blocks and transclusions.
You could also implement something like CVS on top of Storm: 'check out'
files into a normal tree, edit them there, 'commit' into a 'repository'
built from Storm blocks. Other than that, yeah.
(I also wouldn't want to be limited to hierarchy...)
* Short description of "Structured overlay networks".
What they do, what they accomplish. (paragraph or two)
they are a type of peer-to-peer networks, overlay refers to how e.g.
gnutella and freenet and layed over the Internet.
Hermanni, do you have a good reference?
* Short description of gzz and it's relationship to Storm
this all must be updated to the current Fenfire status
Yep.
Sequenced Comments
------------------
Thoughts and questions that occurred to me as I read.
Abstract
* Very cool. location-independent globally unique identifiers,
append-and-delete only storage, and peer-to-peer networking.
very, very cool.
:-)
* The two major issues addressed are mentioned here: dangling
(unresolved) links and keeping track of alternative versions.
These deserve to be mentioned in the abstract.
Right...
Related Work
* It's not totally clear what the relationship of the related
work is to the current project. Do the systems described
represent old work you've moved beyond, old work that
provided useful lessons (what lessons?), a foundation for
the current work (what parts?), predecessors or clients of
the current work.
Good point. The hypertext part represents old work we've moved beyond,
benefitting from p2p research-- making the assumption that
location-independent identifiers cannot be resolved globally, they have
to use complicated/limited schemes to guarantee backlink integrity.
The p2p part is what we build upon-- we'll use distributed hashtables to
implement Storm block lookup on the public internet.
The p2p hypermedia part is similar work-- not really alternative, or
superceded by us, or anything, just somewhat similar ideas.
* Mention gzz here, and it's relationship to Storm (i.e. gzz
refactored to create Storm as an independent module.)
Ok.
Peer-to-Peer Systems
* Mentions a proposal for a common API usable by DHT systems,
but it's not clear if you plan to build on that, or if it
is a rival, or a predecessor.
We hope to build on it, when implementations become available for Java.
* Hmmm. Probabilistic access seems reasonable for "contact"
scenarios (bunch of people together at a meeting), but not
for "publishing" scenarios (publish document on the web).
May be worth drawing the distinction here.
Yep.
Overview of Xanalogical Storage
* This threw me. A minute ago we were talking about blocks,
now we're talking about characters. Needs a transition to
make the relationship apparent. (Later, you talk about
spans. Those may be precursors to blocks or they really are
blocks. I'm not sure which. Need to anticipate that thought
somehow, and tell how we're building up to it, if that's
what's going on
* Yeah. There's the paragraphs on spans. That threw me, too.
Suddenly I had gone from blocks to characters and now to
spans, and I was pretty confused about how they related.
* "Our current implementation" has me wondering what we're
talking about. At this point, I thought this more "Related
work", like "peer to peer systems". But now it seems it's
all one system? Or was this a previous system, before you
started working on Storm? (Need to make the relationships
apparent.)
All this makes me think we should give "Xanadu" a section in "Related
Work," and then later explain how we explain xanalogical storage in
Storm, in a different way than Project Xanadu did.
Storm Block Storage
* Now were back to blocks. Why did that last section exist,
anyway? (make the relationship apparent)
:-)
* "caching becomes trivial, because it is never necessary to
to check for new versions of blocks". Hmm. This sounds like
versioning isn't supported, which seems like a weakness.
I know that telling I reviewer "but we said this" is a no-no since if
you have to say so, you apparently didn't say it well enough :) but in
this case I must ask: The first paragraph of that section ends with,
"Mutable data structures are built on top of the immutable blocks (see
Section 6)." Any ideas on how to make explicit that we'll get to
versioning later on?
(Talking about versioning first wouldn't work since we can only explain
our approach there having explained block storage, first...)
* Interesting. There is a need for "anonymous caching". That
allows replication, while resolving the privacy concern.
Yep.
* A block is hashed. Ok. And a doc contains pointers to blocks.
Ok. But is a doc a block? How is it hashed? How do links
contribute to the hash?
Each version of a doc is a block... Links: Depends on how you make them
(i.e., the format of the document): If they are inline, as in HTML, they
contribute to the hash. If they are external-- anybody can contribute
links by putting them in another block-- they do not.
In both XLink and Xanadu, links can be both inside a document (which
gives them additional credibility... e.g. the user should be able to
select 'view only links contained in the document') or they can be external.
* Gzz is first mentioned here. It needs to be described earlier
in the Xanalogical addressing section.
Probably we should move the xu section after the block storage section,
actually... reducing the back-and-forth.
* "Storm was first developed for the Gzz application, a platform
explicitly developed to overcome the limitations of traditional
file-based applications" -- a *very* intriguing statement.
When Gzz is introduced, this statement needs to be expanded to
provide a short list of those limitations, and what Gzz did to
solve them. (It has to be very short, of course -- no mean feat.)
Challenging. :-) But you're right.
* "UI conventions for listing, moving, and deleting blocks"
I don't know. That sounds wrong to me. Blocks should be
under the covers, and I should be dealing with docs. Ok,
so I have an outline-browser (for example, ADM's) or a
similar editor. Internally, blocks are moved around when I
edit. But my access is always thru a "Doc" -- otherwise I'll
be looking at blocks that are divorced from any context whatever.
You're right.
Application-Specific Reverse-Indexing
* This lost me pretty quickly. I wasn't sure what the purpose
of this section was. I needed a use case or two to keep me
oriented. Later, it becomes clear that this is
a part of the versioning solution. Mention that fact here.
If possible, also give one or more examples of the other
indexing systems you created, to show what this section is
for.
Maybe this section should switch places with the versioning one.
* keyword searching
--it seemed to me that a keyword index would return every
*version* of a block that contained the word, which would
be a real weakness.
--(maybe versioning needs to be described first, so you can
discuss the indexing process in context, and mention the
resolutions for such issues?)
Ok, another reason. :-)
BTW, my take would be that the indexing would indeed return every
version of a document ('version of a block' doesn't exist since blocks
are immutable :) ). The UI would then sort out which versions are
'current' and show only those. This would also allow searching in past
versions, when desired.
Versioning
* Aha! I read the paper over several days, and so much water
went under the dam that I had forgotten this was mentioned
at the beginning of the paper.
* "if, on the other hand, two people collaborate..."
VERY nice. Multiple "current version"s are allowed to exist.
That's the only possible way to handle the situation.
* Note 6:
It wasn't clear to me how it knows which pointer blocks are
obsolete.
A pointer block is obsolete if it is on the 'obsoleted' list of any of
the other pointer blocks.
* Beautiful statement of points for further research
(authenticating pointer blocks, UI for choosing alternative
versions, suitability for web-like publishing). But the
system looks strong enough to make me *want* to do such
experimentation
Great!
Diffs
* It wasn't clear if the most recent version was "intact" and
previous versions were stored as diffs. I would hope so,
in general. At least, if there was only one option, that's
the one I'd want. Or can you do it either way?
You can do it either way (unforch, we do not have an implementation
keeping the most recent version, yet). You would store the same diffs in
both cases, I think, keeping the most recent version only as a kind of
cache.
Not keeping the most recent version has reliability benefits: If your
chain of diffs is broken, you notice when you try to load the current
version (and you've only lost the current version, since you haven't
changed the version before that). If you keep the most recent version,
there may be a problem with one diff you do not notice because you never
look at the diffs, you just load the current version. Now imagine you
save your data again, the old "intact" version is deleted, but creating
the new "intact" version goes awry. You have now lost all data from the
broken diff till the broken full version.
Of course, our code will check whether it can reconstruct the full
version before trusting a diff... but if the requirement for reliability
is especially high, you may want to take the safer route of not storing
an "intact" version.
* Yes. This is the point of the article. Dangling links and
version handling. Definitely belongs in the abstract.
Yep.
* Impact of immutable blocks on media use needs a mention
here. (Maybe just hand-waving, but some mention of the
fact that it's going to cost disk space, in return for
improved ability to do xyz, is needed.)
You mean storage media? Yes. Hey, I actually think we should make the
point here that when you copy an image to another document, or keep
differently edited versions of a movie, Storm stores the content only
once-- and can thus *save* disk space :-)
Conclusions
* Wild. A Zope based on Storm. Or an OHP.
--what's an OHP, anyway. (needs a one-line definition)
--come to think of it, I recognize Zope, but not everyone
will. That needs a one-line explanation, as well.
Right.
* "structured overlay networks such as DHTs"
--I need another paper describing these things, so I can
find what they heck they are and how they work!
We really need a good reference about this. Hermanni?
Bottom Line
-----------
An excellent read, and a most promising technology.
Thanks for sending it to me.
Thank you very much for your comments.
- Benja