[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-developers] Distributed Keyword Searching

From: Steven Barker
Subject: [GNUnet-developers] Distributed Keyword Searching
Date: Mon, 14 Jul 2003 19:55:53 -0400
User-agent: Mutt/1.5.4i

Well, in my previous bunch of ideas I gave up on distributed keyword
searching because of the problem of identifying RBlocks encrypted with
different keywords that point to the same data.  However, I've thought
of how a way to solve that problem, which is detailed below.

Anyway, here's the newly complete writeup about distributed searching:

* Distributed boolean keyword searches.

It should be possible to search AFS in a distributed manner for any
combination of keywords using AND, OR and NOT boolean operators.  This
would let you do a searches like "GNUnet AND 0.5.4a AND NOT rpm" or
"anime OR (Japan AND animation)".  Of course some searches that would
obviously return infinate or empty sets should be excluded.

I don't think think this would be hard to implement.  A boolean Query
message would simply describe the boolean expression in a compact
format. One way would be to rearrange the expression so that all the
NOT operators occur directly on keywords (i.e. "NOT (foo AND bar)"
would become "NOT foo OR NOT bar").  The expression could then be seen
as a binary tree, where the leaf nodes are the (possibly negated)
keywords and all the other nodes are ANDs and ORs.  The representation
of the expression in the boolean Query would be a pre-order traversal
of this tree (you could also describe this as putting the expression
in "reverse-polish" notation: "(a AND b) OR (c AND d)" would become
"OR AND a b AND c d").  An expression with N (possibly negated)
keywords would have N leaf nodes (the keywords) and N-1 branch nodes
(the ANDs and ORs).  If we use a single byte to identify the type of a
node in our pre-order traversal, we can write an expression with N
keywords in 22N-1 bytes (2N-1 1-byte node identifiers and N 20-byte
keyword hashes).  A boolean Query could even use the normal Query
message type (with upto 47 keywords per query, if my math is right),
and rely on recievers to notice the unusual (odd number) size to know
it's a boolean Query.

Replies to boolean Queries would simply be one 3HASH reply per
non-negated keyword, per data match.  A node that recieves a Query
could decide to distribute the work, rather than trying to find the
results entirely on it's own.  It would simply split off one or more
subtrees of the boolean expression and send them to neighboring
nodes.  When results come back, it would check them against the part
of the tree that it kept and finally send any results that matched
the whole expression back to the originator.

The issue that made me skip posting this plan with my previous list
was how to check if RBlocks encrypted with different keywords pointed
to the same data.  This can be solved if we allow some RBlocks to have
an unencrypted "FileIdentifier" field.  This could either be done by
rearranging the current RBlock header so as to make encrypting just
part of it possible, or the FileIdentifier could simply be repeated in
the unencrypted (currently padding) portion of the block.

I would like to claim that allowing nodes to see what data is pointed
to by some RBlocks will not make censorship easier.  This is because
the node still does not know what keyword that RBlock is indexed by.
A dictionary attack could find RBlocks for certain keys, but that can
be done to fully encrypted RBlocks already.  In fact, the only
intersting thing I can imagine an AFS node doing with a semi-encrypted
RBlock is deciding to drop a reference if the data is points to cannot
be found for an extended period of time.  I hope someone will point
out any attacks I'm not seeing.

Steven Barker                                      address@hidden
  Excess on occasion is exhilarating.  It prevents moderation from
  acquiring the deadening effect of a habit.
                -- W. Somerset Maugham
Get my GnuPG public key at:
Fingerprint: 272A 3EC8 52CE F22B F745  775E 5292 F743 EBD5 936B

Attachment: pgpnHiJZgcaAD.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]