[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GNUnet-developers] Distributed Keyword Searching

From: Steven Barker
Subject: Re: [GNUnet-developers] Distributed Keyword Searching
Date: Tue, 15 Jul 2003 04:39:18 -0400
User-agent: Mutt/1.5.4i

On Tue, Jul 15, 2003 at 04:45:50AM +0200, Christian Grothoff wrote:
> Having a non-encrypted identifier portion in the RBlock introduces security 
> problems since malicious peers can tamper with them.

Hmm, that's true.  However, if the identifier is duplicated (included
in both the encrypted and unencrypted parts), the originating node
could detect the tampering.  We could also allow signatures on RBlocks
(do we already?) that would include the cleartext identifier. All
intervening nodes would then able to notice if the signature becomes
invalid due to tampering.  A malicious node could still put a new,
bogus signature on a tampered RBlock, but the originating node would
not only detect the tampering, it could also decide to distrust
anything else signed by that key (and maybe publish its distrust to
other nodes).

It's might also be possible to do a distributed search on SBlock
keywords, even though there's no unused space in the SBlock to add a
plaintext copy of the FileIdentifier (so the only copy of the
FileIdentifier would have to be plaintext).  It would only work for
SBlocks that point to directories: valid directories would have to
contain complete plaintext copies of all the SBlocks that point to
them.  That would make tampering evident, since a malicious node can't
decrypt the SBlock it's tampering with, and so any directory it
changes the FileIdentifier to will not have the decrypted SBlock.  The
node originating the search can download the directory and verify that
the SBlock that directed it there was decryptable by the creator of
the directory.

>                                              Worse, you do not 
> specify how you would union the RBlocks over the network (I may have file X 
> under keyword A and you have file X under keyword B. How do you put these two 
> results together such that X does not match "(A OR B) AND NOT (A AND B)"?

Well, you made me think about this a bit and I think that it would
only ever be worth distributing query's rooted with ANDs.  My idea of
using DeMorgan's law to move all the NOTs out to the leaf nodes won't
work.  Instead NOT's can only come immediately after an AND, and must
apply to either OR or a single keyword (it might make sense to flatten
nested ORs too, and make the single keyword case just an OR with only
one value).  So your example is not fully distributedable.  In fact, a
smart impementation should notice that it already has to request A and
B seperately to solve the undestributable "A OR B" and so it can solve
"A AND B" without any additional queries.  Given a fully distributable
Query like "(A AND -B) OR (-C AND D)", a node could create two new
sub-Query's: "A AND -B" and "-C AND D".

To implement the logic of an AND, a node would simply cache the result
sets of the two parts, and after a short time scan through them to see
where one RBlock from each set points to the same file.  Where there
is a match (or, for AND NOT, where there's not a match) the blocks
would be sent in a reply.

> Also, the current approach of searching for A and B individually and putting 
> the boolean expressions back together at the client side seem to work nicely. 

Well, for queries with lots of ANDs you'll get a lot of results that
the client will have to drop.  Distributing the boolean nature of the
query will get those bad results dropped much earlier, meaning the
network will have to do less work.

> "OR" is trivially implemented by doing multiple searches, and "NOT" could be 
> implemented theoretically but was not in the current code since we would need 
> the ability to "revoke" results (X matches when we get back A, but does no 
> longer match when we get back B _later_).

Well, other than the trivial OR, any boolean logic search will requre
caching of results at some point.  If the logic of the search is
distributed, the caching will be too.  When the logic is done entirely
by the client current way the caching must be done by the client as

Hmm, you know I've just thought of another way to think about the
distributed searches: Finite sets.  A search for "A" results in a
finite set of all the RBlocks and SBlocks with keyword "A"; "A"'s
result set.  Boolean searches simply do set arithmatic on the result
sets of the keywords (i.e. the result set of "A OR B" is the union of
the result sets for "A" and "B", the result set "A AND B" is the
intersection of their result sets and "A AND NOT B" is the difference
between the result sets).

>                               I felt that it was more reasonable 
> for the user to see all search results instantly and that it would be 
> counter-intuitive (and difficult to code) if we had to remove them from the 
> screen again.  But "NOT" (and "OR") can certainly be done without touching 
> anything but the AFS CLI/GUI code.   I'm just not sure if that would not 
> complicate the interface more than what the additional expressiveness would 
> be worth.

Well, I don't think you'd ever have to revoke results you'd already
sent. After all, searching is expected to be imperfect.  There will
never be any way to be sure that none of the results of your "A AND
NOT B" search are stored somewhere under keyword "B".  But you also
can't know that you have found all the possible results that are
stored under keyword "A".  If a a node (or the client) finds out about
a result for "B" after it's already sent on the data found under "A",
*that's just too bad*.  It will have to drop that reply on the floor,
just like it will during a normal search when a reply after the
query's indirection info has been squeezed out of the routing table.

In fact, a node handling a boolean query should cache recieved
responses for as long as possible, so as to get the greatest accuracy.
Only when the TTL runs down and the routing information is about to be
dropped would the logic be checked on the cached data.

Steven Barker                                      address@hidden
  Tact in audacity is knowing how far you can go without going too far.
                -- Jean Cocteau
Get my GnuPG public key at:
Fingerprint: 272A 3EC8 52CE F22B F745  775E 5292 F743 EBD5 936B

Attachment: pgpLu2Hs62JZg.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]