I just read over the thread on the cvs2svn list about this -- I have a
few random thoughts. Take them with a grain of salt, since I haven't
actually tried writing a CVS importer myself...
Regarding the basic dependency-based algorithm, the approach of
throwing everything into blobs and then trying to tease them apart
again seems backwards. What I'm thinking is, first we go through and
build the history graph for each file. Now, advance a frontier across
the all of these graphs simultaneously. Your frontier is basically a
map <filename -> CVS revision>, that represents a tree snapshot. The
basic loop is:
1) pick some subset of files to advance to their next revision
2) slide the frontier one CVS revision forward on each of those
files
3) snapshot the new frontier (write it to the target VCS as a new
tree commit)
4) go to step 1
Obviously, this will produce a target VCS history that respects the
CVS dependency graph, so that's good; it puts a strict limit on how
badly whatever heuristics we use can screw us over if they guess wrong
about things. Also, it makes the problem much simpler -- all the
heuristics are now in step 1, where we are given a bunch of possible
edits, and we have to pick some subset of them to accept next.