sourcesafe2subversion Migration tool

Created 23rd July, 2006 06:20 (UTC), last edited 29th October, 2008 16:07 (UTC)

As far as I can tell Brett Wooldridge started this off with a Perl tool to migrate his Visual SourceSafe databases to Subversion. His approach was then copied and improved on by Power Admin to give a C++ solution. Now I guess it's my turn… The tool's home page is on the FOST.3™ web site at http://fost.3.felspar.com/FOST.3/Application:/sourcesafe2subversion.

My starting point was Power Admin's migration tool. I haven't studied Brett Wolldridge's Perl version as I can't really read Perl but I am assuming from his write up and Power Admin's write up that the two tools use the same migration strategy.

Using Power Admin's tool to port our older alpha development branch of FOST.3™ into the test repository took a few hours to work out how to use the software then just over 2½ hours to run. Checking through this afterwards exposed the major flaw in the migration method: it was oriented towards the structure of the files in SourceSafe at the time of migration and was not able to deal with shared files. For the way that we develop here this was a problem in the migration.

We use a lot of shared files and setting this up manually in Subversion was certainly going to be a major pain. It seemed to me that a better approach would be to replay the operations that had been done in the original database on the new Subversion repository. This would allow me to keep the shared files as logically shared within Subversion, something that I was sure would make the transition much smoother. It would also have the advantage of forcing me to learn all aspects of Subversion properly before I was trying to show others how we were going to use it.

I felt that it would be fairly simple to produce a slightly more complex version of Power Admin's software that could merge together file histories wherever possible. It turns out that this just isn't possible and it took me many days to work out why not.

  1. SourceSafe only gives us access to changes on files and projects (I'll call them directories from now on to avoid confusion) that are still extant at the time that you query it.
  2. There is no GUID for any file or directory and all the history has the names and paths that the file or directory had at that time and not what they have now.
  3. The entire history of a given file is not shown in its history reports. File/directory renames are shown in the parent directory history (the parent they had at the time no less), and directory sharing is shown in the parent directories.
  4. The time resolution is only to the nearest minute which means that many events that must be replayed in the right order and are split between project and file histories have to be done by guess work.
  5. Files and directories can get "lost" when a parent directory their history has moved through has later been deleted.
  6. Although we get a hint to tell us the location where a file is checked in we don't get good information about where the file is first created.

Nobody wants to be bored with details of the wrong turns so I won't talk about them. I'll only discuss the obstacles that are left after all of my dead ends have been filtered out.

I expected the development to take a couple of days. In the end it took a couple of weeks — every time I thought I'd made a breakthrough SourceSafe and later Subversion added an extra layer of complexity. It wasn't until I fully embraced the very complex data structures easily done in C++ that I started to make real headway.

Multidimensional shenanigans

We can think of the revision control information in SourceSafe like stacks of versions for each file (that we are still using at the point of the migration) with the newest version at the top of the stack. The stacks are laid out by the directories that contain them. Of course some stacks are higher than others (there are more versions) and in actual fact some stacks are exact copies of each other (where the files are shared), and more interestingly sometimes parts of stacks are copies of each other (where a file was shared and then branched).

The simplest way to deal with this is to ignore the duplication between stacks and parts of stacks and to treat each stack of versions completely independently. This is the view that Brett Wooldridge and Power Admin's approach takes of the SourceSafe database and the view that they leave the Subversion repository in. It guarantees that all of the file versions are present, but it also looses a lot of the information that is in the SourceSafe database.

A more accurate view is that we have a tree structure of directories and files that change over time. In order to work out where to put things at certain times we have to work out how this tree changes over time and this is what sourcesafe2subversion does.

This task is thankfully made easier because FOST.3™ already included a lot of support for handling time based changes as they are very common in business systems. Using these same data structures to model an ever changing directory tree was fairly straightforward, although also fiddly due to the fact that we have to construct the tree backwards in time and replay it forwards in time.

Hopefully all of this will make more sense as I explain in detail what happens.

Reading the history

The first part of the migration is to read the file history for each of the files that we're going to migrate. The tool only considers files that are still visible in the SourceSafe database at the time of migration — this is in keeping with the earlier tools. Also in keeping with them it ignores labels. If you need deleted files or directories to be transferred then look for them in SourceSafe and undelete them first.

The history files that SourceSafe generates need to be parsed to fetch the required information¹ [1It actually takes SourceSafe quite a long time to generate these file histories and because of this there is an option to cache the file histories. This can massively speed up the start up time for the tool when running it multiple times to work out problems.].

For files we only look for create and check-in events, ignoring everything else. For directories we look at a wider range of tasks, including sub-directory creation and deletion; file deletion; file and directory recovery; and renames, amongst others.

We can largely ignore branches because Subversion does a natural “soft” branch at each copy command. What this means though is that where a file is not branched (in SourceSafe) then there will be separate Subversion check-ins for each directory that the shared file resides in.

The history from SourceSafe preserves the relative ordering for any given file or directory (I'll just call these histories), but gives us the history in reverse order (with some interesting gotchas). We reverse these to get the natural order.

Pinned files

Pinned files are evil and the need for them is because SourceSafe doesn't have proper roll back support. They do of course complicate the file histories.

When you request the initial file listing from SourceSafe it will normally just put the files that are found in a project in this sort of format:

$/FOST.3/sourcesafe2subversion:
event.cpp
event.hpp

This is pretty easy to parse, but if you have a file pinned then you get something that looks like this:

$/FOST.3/sourcesafe2subversion:
event.cpp;2
event.hpp

This tells me that the SourceSafe file $/FOST.3/sourcesafe2subversion/event.cpp is using version two rather than the latest version. This tool throws away this extra information.

After using the tool you may need to analyse any pinned versions that are found² [2If you use the history caching option then you can just do a search for semi-colons in the file $.txt to discover them.]. For pinned files you have essentially three options:

  1. Use the latest version anyway. This is the do nothing option.
  2. Roll the file back in SourceSafe and lose the later versions altogether.
  3. Manually copy the earlier version in Subversion when the migration has completed.

We only had one pinned file in our master database so I just copied the required version after the migration completed.

Temporary files

Visual Studio .NET seems to have introduced another fun new features when you add projects to SourceSafe. It creates some temporary files in the new project directories and then immediately deletes them. These delete instructions put other instructions out of order meaning that some files will end up in the wrong place.

Because of this any delete command for a file whose name starts with a tilde (~) will not be added to the history. If you use files starting with tildes you will need to have a think about how to deal with this.

Sorting the events

Because the time resolution for the history in SourceSafe is poor and we need to replay events in the right order the first thing that we do is to calculate an order for the events. Two heuristics are layered together to get an ordering that should work.

Remember that the events that we need to re-run are stored in separate histories for different types of event and the SourceSafe time resolution is so poor that we can't just use its time stamps to work out the proper order.

Our starting point is the full list of events split into a structure that allows us to group them by any SourceSafe time stamp. We know the relative order for the events in a given history because SourceSafe records them in the right order (although it delivers the history in reverse order the first stage has already sorted this out³ [3This is of course not the whole story. Sometimes SourceSafe records the history in the right order and we have to find these too in order to put them in the reverse order before reversing them into the correct again — or something like that…]).

The explanation works best if the second heuristic is explained first. When we look at the events that happen in any given minute we want to list these in the order that we will play them. We know that the events within a given history are already correct so we will not touch the relative ordering in the histories. This means that if we can give a weight for each type of event then we simply go through the histories picking the lightest event from all the histories we are looking at.

The weights are assigned such that directory creation is lighter than getting a version which in turn is lighter than directory deletion. This works fine, but for project (directory) sharing in SourceSafe. SourceSafe shares the files in a directory before it creates the sub-directories. Because the sub-directory shares will appear first in the histories this means that when we replay the events we will try to share into directories that do not exist. This is where the first heuristic comes in.

What we would have liked is for SourceSafe to create the sub-directories before it does the share at a given directory level. As it doesn't do this we attempt to spot this behavior and reverse the events that happen at this time stamp. This is the place where the tool is most likely to get confused and it happens in one specific circumstance, described next.

Sharing

Here is the normal effect of sharing a project (directory) in SourceSafe. What we have done here is to share the WWW project into $/FOST.3/Project1.

                     Kirit      11/05/06  14:16    Added _debug                          
                     Kirit      11/05/06  14:16    Added _fslib                          
                     Kirit      11/05/06  14:16    Shared $/FOST.3/Project1/WWW/config.ini                                   
                     Kirit      11/05/06  14:16    Shared $/FOST.3/Project1/WWW/favicon.ico
                     Kirit      11/05/06  14:16    Created WWW

It looks to me that SourceSafe has recorded the events in the opposite order in the history for some reason, but it could be that it just does the sub-directories in some random order. In any case the first heuristic that we use spots this case and puts the sub-directory creation events first by reversing the events for that directory during that minute.

Here is one with a problem though:

                     Kirit      30/08/05  14:24    Shared $/FOST.3/Project2/WWW/config.ini                                   
                     Kirit      30/08/05  14:24    Added _fslib                          
                     Kirit      30/08/05  14:24    Added _debug                          
                     Kirit      30/08/05  14:24    Shared $/FOST.3/Project1/WWW/favicon.ico
                     Kirit      30/08/05  14:23    Created WWW

What happened here is that the WWW was added manually and then the favicon.ico was shared. I then followed this by individually sharing the sub-directories and finally sharing the file config.ini. The heuristic can't spot this and for this reason the events occur in the wrong order.

In order to overcome this the parser looks for a particular comment format for a file share. We can edit the comment for this event in the directory history and set it to:

sourcesafe2subversion timestamp: 2005-08-30 14:25:00.000

This tells the parser to not use the timestamp from the history, but to use the specified one instead. By setting this just one minute later the correct re-ordering of the events at 14:24 will occur.

In the migration of our SourceSafe database this had only happened two times so was fairly easy to work around by changing the time stamps for two file shares.

Planning

We now have the file structures as they are currently and all of the file histories going back in time as if they were all in their current locations. The planning phase attempts to work out how this structure changed over time so that we can work out where files where at earlier times. In particular we are interested in merging shared files into a single history wherever possible and as an added bonus we also get to replay delete and undelete commands in many circumstances.

There are a number of stumbling blocks though, the most important of which is that we don't have history for any directories have subsequently been deleted. We can't get them by just undeleting and taking a peek because the records may have been purged from the database.

We also cannot even assume that that the database was empty when it was first created as there can be file and directories from earlier than this if they were restored into the database. This all makes the planning rather tricky, but it does its best.

In order to deal with situations where it doesn't have information it makes use of a holding directory called _vss_deleted where file versions that can't be found a home anywhere else are put. This includes things like all file histories for structures that have been restored into the database and also individual file versions for files where the trail has gone cold due to a directory along their history having been deleted.

This leads to a situation where we get multiple check ins for a file that is actually the same file. If we consider a file that has been shared between projects A and B and then later from B to C we will end up with three files when planning starts as there are three files in the SourceSafe database now.

This planning stage will merge these together so that when it replays the SourceSafe events it will only play a single event into Subversion before the branch to B and then only two until the branch from C. After this point it will have to write all three versions for each change.

Lost files

When the planning moves back it will spot and merge the files from C and B and it will then spot and merge the files from B and A. This means that it will knock out all versions of the file in C prior to the share from B and all versions of the files in C and B prior to the share from A. This is all well and good and helps to migrate the true history of the SourceSafe database.

The problem comes when B has subsequently been deleted. In this situation there are two files at the beginning of the planning stage, those in A and C. The share instruction is lost as we don't have any history for B and hence we know that the file in C has history from before C was created but we don't know where that history should be.

Under replay these orphaned files are put into _vss_deleted and for each early version of the file in C we will get a version written to both A and to _vss_deleted even though they are actually the same versions.

Although this is the safest approach to take there is an alternative way of doing it. We can get the directory that a particular check-in was done to (although not for the first version) in the file history. If we find a file in the specified directory with a check-in at the same time then this is probably the same file. I say probably because there is a circumstance under which they will be different.

If we take our same example of directory sharing from earlier we can imagine two files X and Y both in A. These will get shared to B and then on to C. As described earlier, when we have all three directory histories available to us we know what happened and everything is fine. We can merge them backward in time during the planning phase. Of course if B is missing then we lose the connection between the versions in C and the versions in A. But we can find any check ins for the files in C in A because the history will say something like:

*****************  Version 2  *****************
User: Kirit        Date:  8/10/02   Time: 18:36
Checked in $/A
Comment: Example checkin

And, what's better is that we will see this in our versions in C. This looks like a good enough basis to merge them with, but there is a problem. What if in B the file X was renamed to X.temp, Y was then renamed to X and X.temp renamed to Y? What we do now is to merge the files back to the wrong versions because we cannot see the rename as we cannot see B'''s history. Because of this potential problem the option in the configuration Merge orphaned files can be turned off if you are worried about this. However to try to minimise the problem the following checks are done: # All prior versions must have the same version number; # All prior check in time stamps must match; # The check in path on all prior versions must match. With these extra checks in place it seemed to me that this potential renaming problem probably wasn't going to come up and if it did happen then we'd live with it — for us it wouldn't be an issue. In order to mitigate against it completely there are two other options: # Restore the deleted directory B if it is still available. This brings back in the history that can be used to merge them properly. You can delete it again from Subversion after the migration is complete. # Get the planning stage to do a get on the earlier versions and compare them. Only allow the merge if they actually are identical. These could both be added in, but I felt it wasn't worth it to do the extra coding for this pathological case. ===Replay=== Lest those of you reading this think that all of the problems come from using SourceSafe then let me disabuse you of that right now. The problems with using the command line client svn'' of course only really start with this stage, which takes the events we've worked out and replays them into the Subversion repository.

Apart from the cryptic error messages and the poorly described commands and options there is also one missing command — undelete. I'm sure the lack of this command has been gone through a zillion times one place or another and of course the response is that you just copy the file from the revision it was last in:

svn copy svn://angelo.felspar.net/deleted.txt --revision 12345

That's great, but how do you find the revision number? Well, that's pretty easy too, you just do this:

svn log -v svn://angelo.felspar.net/

This is all well and good, but it's pretty hard to do programmatically. You have to issue the second command and pipe the content into a file and then read that back. Note though that the URIs you find in that file have been decoded from UTF-8 and it's anybody's guess which character encoding the file uses.

It would be really helpful if one of the following two would work:

svn copy svn://angelo.felspar.net/deleted.txt --revision PREV
svn recover svn://angelo.felspar.net/deleted.txt

Audit

It's virtually impossible to audit the that directory transformations have occured correctly as we don't have full information from SourceSafe, but we do need to ensure that we have the right end result. To this end the tool has an option to perform a full get using SourceSafe and an export from Subversion once the migration is complete, but before the final tidying up is done.

A tool like WinMerge can be used to determine that the two structures are the same. There may be some minor differences (for example with pinned files). You should be able to explain all differences and make a decision about what to do about them.

Most problems with the tool would normally result in it getting confused and not being able to complete the migration. This is the final stage that should give you confidence that the migration has worked as it should before you start to work in the new repository.

(other) Potential problems

There are a few things that the current version of the software assumes hold. You should check these through an import to a test repository for every SourceSafe database before you migrate to your production Subversion repositories.

Character encodings

The tool is Unicode, Subversion uses UTF-8 internally and SourceSafe uses… well who knows? If you have different people checking things in to SourceSafe who use different locale settings then either SourceSafe handles it properly or you're already having problems.

Once you've made the switch to Subversion then these character encodings problems should largely disappear, but I haven't managed to work out all of the encoding issues involved in the transition properly as I can only go from the concrete examples I've seen.

I can promise you that the tool's Unicode handling is good and if you can get the SourceSafe history files into a known encoding which you can then read in as Unicode the tool should handle everything properly giving Subversion proper UTF-8 encoded URIs and writing UTF-8 encoded comments. The other command line programs that are used should also work properly as they are all executed through the Windows Unicode API.

Fingers crossed…

Date and times

There are a number of dependencies for the dates used to order the changes. Check the following items:

  • Dates are reported in European format, i.e. dd/mm/yy.
  • SourceSafe reports years as only the final two digits. If you have any activity in a SourceSafe database before 2000 CE you will need altered date parsing. The change should be fairly trivial.
  • The locations of the date and time parts are hard coded for the normal report layouts. I don't know what happens if there is a long user name in a history report.

Parsing problems

There are a number of places where the tool needs to parse the output of other programs.

  • SourceSafe histories follow a pattern that the parser then looks for, but if you use lots of stars in history comments the parser may get confused.
  • Subversion histories are structured in a much better format for machine reading. If there are changes to this format then the parser may fail. It would probably be better to use the Subversion API directly to avoid all of this.

SourceSafe gotchas

These are mostly mentioned in the text, but just for quick reference here is a recap:

  • Files whose names start with tildes may cause problems. Visual Studio .NET does some odd things with temporary files so delete instructions for these files will not be replayed. If you have files starting with tildes and they have never been deleted you will be fine. Note that there is no filter for directories (projects) starting with a tilde.
  • Try to decide what to do with pinned files first, but don't worry too much about them as they will show up as discrepancies during the final audit.
  • You may need to change some time stamps for events in some circumstances. Check out the explanation under the Sharing heading above.

Categories: