monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Scalability question


From: Nathaniel Smith
Subject: Re: [Monotone-devel] Scalability question
Date: Fri, 4 Aug 2006 19:09:34 -0700
User-agent: Mutt/1.5.12-2006-07-14

On Fri, Aug 04, 2006 at 02:34:11PM -0400, Jonathan S. Shapiro wrote:
> If I understand the documents correctly, there are a whole lot of places
> in the monotone schema that are very similar to things we did in OpenCM.
> One of these bit us badly on scalability. I want to identify the issue,
> explain how it bit us, and ask whether it has been a problem in
> monotone. If not, why not?
> 
> The Monotone "Manifest" is directly equivalent to the OpenCM "Change"
> object. We went through various iterations on our Change objects, and we
> hit two scalability issues. The first arises with very large projects.
> The second impacts initial checkout (in monotone, it would probably
> arise in push/pull rather than checkout).
> 
> Like monotone, OpenCM does not store entries for directories; they are
> implicit in the file paths. In contrast to Monotone, OpenCM adds a level

As mentioned elsewhere in this thread, we record directories
explicitly now.  By the time we got lifecycle sanity checking and
full rename support straightened out, it turned out to be more work to
leave them out than put them in.

> of indirection between our Change records and our Content objects. The
> intermediate object is called an Entity. It stores the (file-name,
> content-sha1) pair and a couple of other things that aren't important
> for this question.

Don't quite follow this.

> Consider a mid-sized project such as EROS, which has ~20,000 source
> files. [For calibration, OpenBSD is *much* larger]. This means 20,000
> sha-1's in the Manifest/Change. In OpenCM, these are stored in binary
> form, so each sha-1 occupies 20 bytes, and the resulting Change object
> is about 400 kilobytes.

Yeah.  We use text, actually, though it's compressed on disk.

> This particular object sees a lot of delta computations, and simply
> reading and writing it takes a noticeable amount of time. Also, the need

We haven't noticed this become a problem yet, though it's certainly
possible it will.  Reading/writing a few hundred kb on local disk,
especially with a hot cache, isn't _too_ bad; and with a cold cache,
directory scanning will almost certainly swamp any dealings with
trees.

(Recall that for monotone, all operations except synchronization are
against local disk only.)

Unfortunately, the places where you have to read this object are the
places that are most speed-sensitive -- 'diff', 'status', 'commit'
should all ideally be sub-second operations, and the only expensive
parts to them is reading the old tree, directory scanning, and
possibly writing the new tree.

> to sync a 400 kbyte object in order to begin a checkout is very
> disconcerting to users -- especially when you are doing it over a slow
> link at (e.g.) a hotel or (e.g.) a PPP link [Yes, a lot of people really
> still use dial-up).

I take it that "checkout" for OpenCM is like CVS checkout -- it
creates a local workspace from some revision on a remote server?

Surely the 400 kbyte object is much smaller than the actual contents
of the 20,000 files that must also be transferred?  From your
description, this sounds less like a scalability problem and more like
a providing appropriate feedback to user problem.

(Interestingly, rsync has exactly the same problem -- it starts with
the potentially _very_ lengthy "transferring file list" part, and
gives no feedback during this.)

-- Nathaniel

-- 
.i dei jitfa fanmo xatra




reply via email to

[Prev in Thread] Current Thread [Next in Thread]