[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [GNUnet-developers] Distributed Keyword Searching
Re: [GNUnet-developers] Distributed Keyword Searching
Tue, 15 Jul 2003 16:49:39 +0200
Am Dienstag, 15. Juli 2003 10:39 schrieb Steven Barker:
> > On Tue, Jul 15, 2003 at 04:45:50AM +0200, Christian Grothoff wrote:
> > Having a non-encrypted identifier portion in the RBlock introduces
> > security problems since malicious peers can tamper with them.
> Hmm, that's true. However, if the identifier is duplicated (included
> in both the encrypted and unencrypted parts), the originating node
> could detect the tampering. We could also allow signatures on RBlocks
> (do we already?) that would include the cleartext identifier.
Well, the SBlocks are the ones that will have a signature, but note that
signature verification comes at a quite high cost for all intermediaries (it
is likely to cost more CPU time than all other operations combined!).
Furthermore, even if we fix tampering that way, how about censorship? The
reason for the encryption is to make it difficult / impossible for
intermediaries to censor specific content. Now if we put the content
identifier in plaintext in the block, censorship for a specific document
becomes possible for the intermediaries. I would not want to reduce
censorship resistance for a possibly more efficient search (note that with
the verification, you're already doing much, much worse CPU wise).
> It's might also be possible to do a distributed search on SBlock
> keywords, even though there's no unused space in the SBlock to add a
> plaintext copy of the FileIdentifier (so the only copy of the
> FileIdentifier would have to be plaintext). It would only work for
> SBlocks that point to directories: valid directories would have to
> contain complete plaintext copies of all the SBlocks that point to
Again, what a wonderful situation for the censor. He get's a list of all the
>That would make tampering evident, since a malicious node can't
> decrypt the SBlock it's tampering with, and so any directory it
> changes the FileIdentifier to will not have the decrypted SBlock. The
> node originating the search can download the directory and verify that
> the SBlock that directed it there was decryptable by the creator of
> the directory.
> > Worse, you do not
> > specify how you would union the RBlocks over the network (I may have file
> > X under keyword A and you have file X under keyword B. How do you put
> > these two results together such that X does not match "(A OR B) AND NOT
> > (A AND B)"?
> Well, you made me think about this a bit and I think that it would
> only ever be worth distributing query's rooted with ANDs. My idea of
> using DeMorgan's law to move all the NOTs out to the leaf nodes won't
> work. Instead NOT's can only come immediately after an AND, and must
> apply to either OR or a single keyword (it might make sense to flatten
> nested ORs too, and make the single keyword case just an OR with only
> one value). So your example is not fully distributedable. In fact, a
> smart impementation should notice that it already has to request A and
> B seperately to solve the undestributable "A OR B" and so it can solve
> "A AND B" without any additional queries. Given a fully distributable
> Query like "(A AND -B) OR (-C AND D)", a node could create two new
> sub-Query's: "A AND -B" and "-C AND D".
> To implement the logic of an AND, a node would simply cache the result
> sets of the two parts, and after a short time scan through them to see
> where one RBlock from each set points to the same file. Where there
> is a match (or, for AND NOT, where there's not a match) the blocks
> would be sent in a reply.
But how would that save bandwidth? Instead of the originating node searching
separately for A and B, you make some intermediary search for A and B and
only respond once it obtained both. It is not clear how that would be more
efficient. Even if it was, consider that a search is very cheap at the
moment. If you download a 3 MB file, you effectively do over 3000 searches.
Now, if you need to do 2 searches to search for A and B and then issue 3000
searches for the actual download, is it really worth saving 1 search
initially while driving up the cost (CPU), not being sure about the benefits
of a distributed search bandwidth wise (I, at least, don't see them yet) and
adding in security concerns (tampering, censorship)?
> > Also, the current approach of searching for A and B individually and
> > putting the boolean expressions back together at the client side seem to
> > work nicely.
> Well, for queries with lots of ANDs you'll get a lot of results that
> the client will have to drop. Distributing the boolean nature of the
> query will get those bad results dropped much earlier, meaning the
> network will have to do less work.
> > "OR" is trivially implemented by doing multiple searches, and "NOT" could
> > be implemented theoretically but was not in the current code since we
> > would need the ability to "revoke" results (X matches when we get back A,
> > but does no longer match when we get back B _later_).
> Well, other than the trivial OR, any boolean logic search will requre
> caching of results at some point. If the logic of the search is
> distributed, the caching will be too. When the logic is done entirely
> by the client current way the caching must be done by the client as
Yes, even the current "AND" logic requires caching. But only of things that
yet only satisfy a query partially. It can never happen that we discover
additional search results and that the set of replies shrinks. Thus we can
add a search result to the display as soon as the information retrieved from
the network matches the query. With NOT, we may have to remove a result
later. And note that there is no fixed TTL for a search; in fact, currently
the search is repeated multiple times with longer TTLs, so except for a
time-out, there is never a definitive point where we stop searching.
> Hmm, you know I've just thought of another way to think about the
> distributed searches: Finite sets. A search for "A" results in a
> finite set of all the RBlocks and SBlocks with keyword "A"; "A"'s
> result set. Boolean searches simply do set arithmatic on the result
> sets of the keywords (i.e. the result set of "A OR B" is the union of
> the result sets for "A" and "B", the result set "A AND B" is the
> intersection of their result sets and "A AND NOT B" is the difference
> between the result sets).
Eh, in which sense are the sets finite? Well, yes, there's a finite number of
atoms in the universe and we only have 2^160 bits for identifiers in GNUnet,
but I still think this is nonsense and doesn't really help us with anything.
> > I felt that it was more reasonable
> > for the user to see all search results instantly and that it would be
> > counter-intuitive (and difficult to code) if we had to remove them from
> > the screen again. But "NOT" (and "OR") can certainly be done without
> > touching anything but the AFS CLI/GUI code. I'm just not sure if that
> > would not complicate the interface more than what the additional
> > expressiveness would be worth.
> Well, I don't think you'd ever have to revoke results you'd already
> sent. After all, searching is expected to be imperfect. There will
> never be any way to be sure that none of the results of your "A AND
> NOT B" search are stored somewhere under keyword "B". But you also
> can't know that you have found all the possible results that are
> stored under keyword "A". If a a node (or the client) finds out about
> a result for "B" after it's already sent on the data found under "A",
> *that's just too bad*. It will have to drop that reply on the floor,
> just like it will during a normal search when a reply after the
> query's indirection info has been squeezed out of the routing table.
True, you can always implement NOT by specifying that NOT may be ignored by
the implementation. Then we don't even have to write code to actually do
> In fact, a node handling a boolean query should cache recieved
> responses for as long as possible, so as to get the greatest accuracy.
> Only when the TTL runs down and the routing information is about to be
> dropped would the logic be checked on the cached data.
Well, as I said, I think that users may prefer the open-ended search where
they get results displayed as they arrive. Why should I not be able to say
"timeout infinity" (which is, btw, how the gnunet-gtk code works) and abort
the search once I'm satisfied with the result set? I don't see waiting for
some timeout to be practical. The timeout is there for the CLI to avoid
searching forever because the user didn't know CTRL-C or forgot about the
process. And, because for most practical searches at the moment, 5 minutes is
"infinity" (as in, longer makes no difference). If I search for "GNU AND
GPL", I can abort once I see the GPL result being displayed, why should I
keep searching until I hit the timeout?