gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-developers] DistribNet and GNUNet Update


From: Kevin Atkinson
Subject: [GNUnet-developers] DistribNet and GNUNet Update
Date: Wed, 15 May 2002 00:14:06 -0400 (EDT)

I have read all the GNUNet papers and you document everything very nicely 
except how the hell you are suppose to find anything on the GNUNet!

A few comments:

The idea of micro payments is a interesting one however, I think using 
some sort of function with a limiting behavior rather than a constantly 
increasing function would be better.  As you have it now I fear that it may 
be imposable for new nodes to ever compete with nodes that have been on 
the network for a long time.

The assumption that a node can fulfill a request for free is not always 
valid, because it may be that the bandwidth is metered rather than being a 
fixed pipe so any network traffic will cost the node.

Now your turn to comment, here is the latest copy of the DistribNet 
design document.  I have worked some of the details of how keys are stored 
and requests for data are made.


DistribNet

Kevin Atkinson (kevin at atkinson dhs org)

Project Page: [http://distribnet.sourceforge.net/]
Mailing list: [http://lists.sourceforge.net/lists/listinfo/distribnet-devel]

Abstract

A global peer-to-peer Internet file system in which anyone
can tap into or add content to.

1 Overview

1.1 Meta Goals

* To allow anyone, possibly anonymously, to publish web sites
  with out having to pay for the bandwidth for a commercial
  provider or having to put up with the increasingly ad
  ridden free web sites. The only thing the author of the
  web site should have to worry about is the contents of
  the web site itself.

* Bring back the sense of community on the Internet that
  was once present before the Internet become so commercialized.

* Serve as an efficient replacement for current file sharing
  networks such as Morpheus and Gnutella.

* To have the network stable and working before some Commercial
  company designs a propitiatory network similar to what
  I envision that can only be accesses via freely available
  but not FSF approved free license.

1.2 (Possibly Impossible) Goals

* Really fast lookup to find data. The worst case should
  be O(log(n)) and the average case should be O(1) or very
  close to it.

* Actually retrieving the data should also be really fast.
  Popular data should be sitting on the same subnet. On
  average it should be as fast or faster than a typical
  web site (such as slashdot, Google, etc.). It should make
  effective use of the topology of the Internet to to minimize
  network load and maximize performance.

* General searching based on keywords will be build into
  the protocol from the beginning. The searching faculty
  will be designed in such a way to make message boards
  trivial to implement.

* Ability to update data while keeping old revisions around
  so data never disappears until it is truly unwanted. No
  one person will have the power to delete data once it
  spreads throughout the network.

* Will try very hard to keep all but the most unpopular content
  from falling off the network. Basically before deleting
  a locally unpopular key it will first check if other nodes
  are storing the key and how popular they find the key.
  If not enough nodes are storing the key and there is any
  indication that the data may be useful at a latter date
  it will not delete it unless it absolutely has to. And
  if it does delete it it will first try uploading it to
  other nodes with more disk space available.

* Ability to store data indefinitely if someone is willing
  to provide the space for it (and being able to find that
  data in log(n) time).

* Extremely robust so that the only way to kill the network
  is to disable almost all of the nodes. The network should
  still function even if say 90% of it goes down.

* Extremely effect CPU-wise so that a fully functional node
  can run in the background and only take 1-2% of the CPU.

1.3 Applications

I would like the protocol to be able to effectually support
(ie with out any ugly hacks that many of the application
for Freenet use)

* Efficient Web like sites (with HTTP gateway to make browsing
  easy)

* Efficient sharing of files large and small.

* Public message forms (with IMAP gateway to make reading
  easy)

* Private Email (the message will off cource be encrypted
  so only the intended recipient can read it, again with
  IMAP gateway)

And maybe:

* Streaming Media

* Online Chat (with possible IRC or similar gateway)

1.4 Anti-Goals

Also see philosophy for why I don't find these issues that
important

* Complete anonymity for the browser. I want to focus first
  on performance than on anonymity. In fact I plan to use
  extensive logging in the development versions so that
  I track network performance and quickly cache performance
  bugs. As DistribNet stabilizes anonymity will be improved
  at the expense of logging.

  The initial version will only use cryptology when absolutely
  necessary (for example key signing). Most communications
  will be done in the clear. After DistribNet stabilizes
  encryption will slowly be added. When I add encryption
  I will carefully monitor the effect it has on CPU load
  and if proves to be expensive I will allow it to be optional. 

  Please note that I still wish to allow for anonymous posting
  of content. However, without encryption, it probably won't
  be as anonymous as Freenet or your GNUNet.

* Data in the cache will be stored in a straight forward
  manner. No attempt will be made to prevent the node operate
  from knowing what is in his own cache. Also, by default,
  very little attempt will be made to prevent others from
  knowing what is a particular node cache.

1.5 Philosophy

* I have nothing against complete anonymity, it is just that
  I am afraid that both Freenet and GNUNet or more designed
  around the anonymity and privacy issues then they are
  around the performance and scalability issues.

* For most type of things the level of anonymity that Freenet
  and GNUNet offers is simply not needed. Even for copyrighted
  and censored material there is, in general, little risk
  in actually viewing the information because it is simply
  impractical to go after every single person who access
  forbidden information. Most all of the time the lawsuits
  and such are after the original distributors of the information
  and not the viewers. There for DistribNet will aim to
  provide anonymity for distributing information, but not
  for actually viewing it. However, since there is some
  information where even viewing it is extremely risky,
  DistribNet will eventually be able to provide the same
  level of anonymity that Freenet or GNUNet offers, but
  it will be completely optional.

* I also believe that knowing what is in one owns datastore
  and being able to block certain type of material from
  one owns node is not that big of a deal. Unless almost
  everyone blocks a certain type of information the availability
  of blocked information will not be harmed. This is because
  even if 90% of the nodes block say, kiddie porn, the information
  will still be available on the other 10% of the nodes
  which, if the network is designed correctly, should be
  more than enough for anyone to get at blocked information.
  Furthermore, since the source code for DistribNet will
  be protected under the GPL or similar license, it will
  be completely impractical for other to force a significant
  number of nodes to block information. Due to the dynamic
  nature of the cache I find it legally difficult to hold
  anyone responsible for the contents of there cache as
  it is constantly changing.

2 DistribNet Architecture

Two types of keys: Map and Data keys. Maps keys are hashed
based on there identification and can be updated, Data keys
are hashed based on there content and consequently can not
be updated.

There will be three type of storage of keys, Permanent, Cache,
and Pointers. Permanent keys will be used to ensure the
available of content, the cache will be used exactly like
a typical cache will be used, and pointers will be used
to be be able to find content.

2.1 Key Types

There will essentially be two types of keys. Map keys and
data keys. Map keys will be uniquely identified in a similar
manner as freenet SSK keys. Data keys will be identified
in a similar manner as freenet's CHK keys.

Map keys will contain the following information:

* Short Description

* Public Namespace Key 

* Timestamped Index pointers

* Timestamped Data pointers

At any given point in time each map key will only be associated
with one index pointer and one data pointer. Map keys can
be updated by appending a new index or data pointer to the
existing list. By default, when a map key is queried only
the most recent pointer will be returned. However, older
pointers are still there and may be retrieved by specifying
a specific date. Thus, map keys may be updated, but information
is never lost or overwritten.

Data keys will be very much like freenet's CHK keys except
that they will not be encrypted. Since they are not encrypted
delta compression may be used to save space.

There will not be anything like freenet's KSK keys as those
proved to be completely insure. Instead Map keys may be
requested with out a signature. If there is more than one
map key by that name than a list of keys is presented sorted
by popularity. To make such a list meaning full every public
key in freenet will have a descriptive string associated
with it.

2.1.1 Data Key Details

Data keys will be stored in maximum size blocks of just under
32K. If an object is larger than 32K it will be broken down
into smaller size chunks and an index block, also with a
maximum size of about 32K, will be created so that the final
object can be reassembled. If an object is too big to be
indexed by one index block the index blocks themselves will
be split up. This can be done as many times as necessary
therefore providing the ability to store files of arbitrary
size. DistribNet will use 64 bit integers to store the file
size therefore supporting file sizes up to 2^64-1 bytes.

Data keys will be retrieved by blocks rather than all at
once. When a client first requests a data key that is too
large to fit in a block an index block will be returned.
It is then up the client to figure out how to retrieve the
individual blocks. 

Please note that even though that blocks are retrieved individually
they are not treated as truly independent keys by the nodes.
For example a node can be asked which blocks it has based
on a given index block rather than having to ask for each
and every data block. Also, nodes maintain persistent connections
so that blocks can be retrieved one after another without
having to re-establish to connection each time.

Data and index blocks will be indexed based on the SHA-1
hash of there contents. The exact numbers of as follows:

+------------------------------------------+--------------------+
| Data Block Size:                         | 2^15 - 128 = 32640 |
+------------------------------------------+--------------------+
| Index block header size:                 | 40                 |
+------------------------------------------+--------------------+
| Maximum number of keys per index block:  | 1630               |
+------------------------------------------+--------------------+
| Key Size:                                | 20                 |
+------------------------------------------+--------------------+


Maximum object sizes:

direct   => 2^14.99 bytes , about 31.9 kilo

1 level  => 2^25.66 bytes , about 50.7 megs

2 levels => 2^36.34 bytes , about 80.8 gigs

3 levels => 2^47.01 bytes , about 129 tera

4 levels => 2^57.68 bytes

5 levels => 2^68.35 bytes (but limited to 2^64 - 1)

Why 32640?

A block size of just under 32K was chosen because I wanted
a size which will allow most text files to fit in one block,
most other files with one level of indexing, and just about
anything anybody would think of transferring on a public
network in two levels and 32K worked out perfectly. Also,
files around 32K are rather rare therefor preventing a lot
of of unnecessary splitting of files that don't quite make
it. 32640 rather than exactly 32K was chosen to allow some
additional information to be transfered with the block without
pushing the total size over 32K. 32640 can also be stored
nicely in a 16 bit integer without having to worry if its
signed or unsigned.

2.2 Storage

Permanent keys will be distributed essentially randomly.
However, to insure availability the network will insure
at least N nodes contain the data. Nodes which are responsible
for maintaining a permanent key will know about all the
other nodes on the network with are also responsible for
that key. From time to time it will check up on the other
nodes to make sure they are still live and if less than
N-1 other nodes are live it will pick another node to to
ask to maintain a copy of the key. It will first try nodes
which already have the key in its cache and if they all
refuse or none of them do. It will chose a random node to
ask and will keep trying until some node accepts or one
the original nodes becomes live again.

Cached keys will be DistribNet based on where it will do
the most good performance wise. How cache keys will be managed
is still undecided. For the first implementation it will
likely be stored on the nodes which have previously requested
the key.

Pointer keys will basically be distributed based on the numeric
distance of the hash of the key from the hash of the node's
identification. Since pointer keys contain very little data
they will be an extremely large amount of redundancy. Pointer
keys will contain two types of pointers. Pointers to permanent
keys and pointers to permanent keys. Pointer keys on different
nodes will all contain the same permanent pointers but will
only contain pointers to cached keys to nodes which or near
by. There will be an upper limit to the number of pointers
within an pointer key any one node will have.

3 Retrieval of Data

When a node A wants to retrieve a key K it first has figure
out where it is. To do this it will contact node B which
will in tern contact C etc, until an answer is found which
for the sake of argument will be node E. Node E will then
send a list of possible nodes L which contain key K directly
to node A. Node E will then send the result to node D, which
will send it to C, etc. Node E will also add node A to list
L with probability of say 10%, Node D will do the same but
with a probability of say 25%, etc. This will avoid the
problem having the list L becomes extremely large for popular
data but allow nodes close to A to discover that A has the
data since nodes close to A will likely contact the same
nodes that A tried. Since A requested the location of key
K it is assumed that K will will likely download the data.
If this assumption is false than node A will simply be removed
at the list latter on.

Once A retrieves the list it will pick a node from the list
L based on some evaluation function, lets say it picks node
X. Node X will then return the key to node A. The evaluation
function will take several factors, into account, including
distance, download speed, past reputation, and if node A
even knows anything about the new node.

If node X does not send the key back to node A for what ever
reason it will remove node X from the list and try again.
It will also send this information to node B so it can consider
removing node X from its list, it will then in term notify
node C of the information, etc. If the key is an index block
it will also send information of what parts of the complete
key node X has. If the key is not an index block than node
a is done.

If the key is an index block than node A will start downloading
the sub-blocks of key K that node X has. At the same time,
if the key is large or node X does not contain all the sub-blocks
of K, node X will chose another node from the list to contact,
and possible other nodes depending on the size of the file.
It will then download other parts of the file from the other
nodes. Which blocks are download from which nodes will chance
based on the download speed of the nodes so that more blocks
are download from faster nodes and less from slower, thus
allowing the data to be transfered in the least amount of
time. If after contacting a certain number of nodes there
are still parts of the key that are not available on any
of those nodes, node A will perform a separate query for
the individual blocks. However, I image, in practice this
will rarely be necessary.

3.1 Distance determination

One very course estimate for node distance would be to take
the numerical distance between two nodes ip address since
nodes closer to easy other numerically are likely to be
share the same gateways and nodes really close are likely
to be on the same subnet. 

Another way to estimate node distance releases on the the
fact that node distance, for the most part, obeys the triangle
inequality. For each node in the list of candidate nodes
some information about the estimated distance between that
node, node E, in the list and the node storing the list
is maintained by some means. For node A to estimate the
distance between a node on the list, node X, and itself
all it has to do is combine the distance between it and
E with the distance between E and X. The combination function
will depend on the aspect of distance that is being measured.
For the number of hops it will simply add them, for download
speed it will take the maximum, etc.

4 Limitations

Because they is no indirection when retrieving data most
of the data on any particular node would be data that a
local node user requested at some point in time. This means
that it is fairly easy to tell what which keys a particular
user requested. Although complete anonymity for the browser
is one of my anti-goals this is going a bit to far. One
solution for this is to do something similar that GNUNet
does which is described in .

It is also blatantly obvious which nodes have which keys.
Although I do not see this as a major problem, especially
if a solution for the first problem is found, it is something
to consider. I will be more than happy to entertain solutions
to this problem, provided that it doesn't effect effectually
that much.

5 Implementation Details

An implementation for DistribNet is available at 
[http://distribnet.sourceforge.net/]. 

5.1 Physical Storage

Blocks are currently stored in one of three ways

1. block smaller than a fixed threshold (currently 1k) are
  stored using Berkeley DB (version 3.3 or better).

2. blocks larger than the threshold are stored as files. The
  primary reason for doing this is to avoid limiting the
  size of data store by the maximum size of a file which
  is often 2 or 4 GB on most 32-bit systems.

3. blocks are not stored at all, instead they are linked to
  an external file out side of the data store much like
  a symbolic link links to file out side of the current
  directory. However since blocks often only represent part
  of the file the offset is also stored as part of the link.
  These links are stored in the same database that small
  blocks are stored in. Since the external file can easily
  be changed by the user, the SHA-1 hashes will be recomputed
  when the file modification data changes. If the SHA-1
  hash of the block differs all the links to the file will
  be thrown out and the file will be relinked. (This part
  is not implemented yet).

Most of the code for the data keys can be found in data_key.cpp

5.2 Language

DistribNet is/will be written in fairly modern C++. It will
use several external libraries however it will not use any
C++ specific libraries. In particular I have no plan to
use any sort of Abstraction library for POSIX functionally.
Instead thin wrapper classes will be used which I have complete
control over and will serve mainly to make the process of
using POSIX functions less tedious rather than abstract
away the details of using them.

References

Krista Bennett and Christian Grothoff. "GNUnet
-- anonymity for free". [http://gecko.cs.purdue.edu/GNUnet/papers.php3]

-- 
http://kevin.atkinson.dhs.org





reply via email to

[Prev in Thread] Current Thread [Next in Thread]