monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] cvs import


From: Michael Haggerty
Subject: Re: [Monotone-devel] cvs import
Date: Thu, 14 Sep 2006 11:48:04 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.5) Gecko/20060728 Thunderbird/1.5.0.5 Mnenhy/0.7.4.666

Nathaniel Smith writes:
> I just read over the thread on the cvs2svn list about this -- I have a
> few random thoughts.  Take them with a grain of salt, since I haven't
> actually tried writing a CVS importer myself...
> 
> Regarding the basic dependency-based algorithm, the approach of
> throwing everything into blobs and then trying to tease them apart
> again seems backwards.  What I'm thinking is, first we go through and
> build the history graph for each file.  Now, advance a frontier across
> the all of these graphs simultaneously.  Your frontier is basically a
> map <filename -> CVS revision>, that represents a tree snapshot.  The
> basic loop is:
>   1) pick some subset of files to advance to their next revision
>   2) slide the frontier one CVS revision forward on each of those
>      files
>   3) snapshot the new frontier (write it to the target VCS as a new
>      tree commit)
>   4) go to step 1
> Obviously, this will produce a target VCS history that respects the
> CVS dependency graph, so that's good; it puts a strict limit on how
> badly whatever heuristics we use can screw us over if they guess wrong
> about things.  Also, it makes the problem much simpler -- all the
> heuristics are now in step 1, where we are given a bunch of possible
> edits, and we have to pick some subset of them to accept next.
> 
> This isn't trivial problem.  I think the main thing you want to avoid
> is:
>     1  2  3  4
>     |  |  |  |
>   --o--o--o--o----- <-- current frontier
>     |  |  |  |
>     A  B  A  C
>        |
>        A
> say you have four files named "1", "2", "3", and "4".  We want to
> slide the frontier down, and the next edits were originally created by
> one of three commits, A, B, or C.  In this situation, we can take
> commit B, or we can take commit C, but we don't want to take commit A
> until _after_ we have taken commit B -- because otherwise we will end
> up splitting A up into two different commits, A1, B, A2.

The main problem with converting CVS repositories is its unreliable
timestamps.  Sometimes they are off by a few minutes; that would be no
problem for your algorithm.  But they might be off by hours (maybe a
timezone was set incorrectly), and it is not unusual to have a server
with a bad battery that resets its time to Jan 1 1970 after each reboot
for a while before somebody notices it.  Timestamps that are too far in
the future are probably rarer, but also occur.  CVS timestamps are
simply not to be trusted.

The best hope to correcting timestamp problems is pooling information
across files.  For example, you might have the following case:

  1   2
  |   |
  A   Z
  |
  B
  :
  Y
  |
  Z

where A..Y have correct timestamps but Z has an incorrect timestamp far
in the past.  It is clear from the dependency graph that Z was committed
after Y, and by implication revision Z of file 2 was committed at the
same time.  But your algorithm would grab revision Z of file 2 first,
even before revision A of file 1.

The point of the blob method that I proposed is that timestamps are
secondary in deciding what constitutes a changeset.  Any changeset
consistent with the dependency graph (subject maybe to some timestamp
heuristics *) is accepted.

[*] Typically, clock inaccuracies will affect all CVS revisions that
made up a change set.  Therefore the suggestion to split blobs that have
more than (say) a 5 minute time gap within them.

> There are a lot of approaches one could take here, on up to pulling
> out a full-on optimal constraint satisfaction system (if we can route
> chips, we should be able to pick a good ordering for accepting CVS
> edits, after all).  A really simple heuristic, though, would be to
> just pick the file whose next commit has the earliest timestamp, then
> group in all the other "next commits" with the same commit message,
> and (maybe) a similar timestamp.  I have a suspicion that this
> heuristic will work really, really, well in practice.  Also, it's
> cheap to apply, and worst case you accidentally split up a commit that
> already had wacky timestamps, and we already know that we _have_ to do
> that in some cases.
> 
> Handling file additions could potentially be slightly tricky in this
> model.  I guess it is not so bad, if you model added files as being
> present all along (so you never have to add add whole new entries to
> the frontier), with each file starting out in a pre-birth state, and
> then addition of the file is the first edit performed on top of that,
> and you treat these edits like any other edits when considering how to
> advance the frontier.
> 
> I have no particular idea on how to handle tags and branches here;
> I've never actually wrapped my head around CVS's model for those .
> I'm not seeing any obvious problem with handling them, though.

Tags and branches do not have any timestamps at all in CVS.  (You can
sometimes put bounds on the timestamps: a branch must have been created
after the version from which it sprouts, and before the first commit on
the branch (if there ever was a commit on the branch).)  And it is not
possible to distinguish whether two branches/tags sprouted from the same
revision of a file or whether one sprouted from the other.  So a
date-based method has to work hard to get tags and branches correct.

> In this approach, incremental conversion is cheap, easy, and robust --
> simply remember what frontier corresponded to the final revision
> imported, and restart the process directly at that frontier.
> 
> Regarding storing things on disk vs. in memory: we always used to
> stress-test monotone's cvs importer with the gcc history; just a few
> weeks ago someone did a test import of NetBSD's src repo (~180k
> commits) on a desktop with 2 gigs of RAM.  It takes a pretty big
> history to really require disk (and for that matter, people with
> histories that big likely have a big enough organization that they can
> get access to some big iron to run the conversion on -- and probably
> will want to anyway, to make it run in reasonable time).

Michael




reply via email to

[Prev in Thread] Current Thread [Next in Thread]