1. How to use this manual

This is the long-form manual for the reposurgeon tool suite.

Everybody should read the Introduction (immediately after this section) to be sure reposurgeon is the tool you actually want.

Then read the Quick start section (immediately following the Introduction) to get a feeling for the syntax and style of reposurgeon commands.

Assuming you’re trying to do a repository lift, you should probably continue by reading A Guide to Repository Conversion. It is not necessary to read the entire main body of the manual after that (the command reference) straight through, but keep it handy for when you need to learn more about a particular command or group of commands.

If you’re doing something more unusual than a conversion, you’ll probably have to read through the command reference until you discover the commands you need.

Help is available within reposurgeon using the "help" command. Type "help" alone for a list of help topics.

If you find an error or omission in this document, you can suggest an edit here.

2. Introduction

The purpose of reposurgeon is to enable risky operations that VCSes (version-control systems) don’t want to let you do, such as (a) editing past comments and metadata, (b) excising commits, (c) coalescing and splitting commits, (d) removing files and subtrees from repo history, (e) merging or grafting two or more repos, and (f) cutting a repo in two by cutting a parent-child link, preserving the branch structure of both child repos.

A major use of reposurgeon is to assist a human operator to perform higher-quality conversions among version-control systems than can be achieved with fully automated converters. Another application is when code needs to be removed from a repository for legal or policy reasons.

Fully supported systems (those for which reposurgeon can both read and write repositories and the support has been tested) include git, hg, bzr, brz, fossil, darcs, RCS, and SRC. There is tested read-side support for SCCS, CVS and svn. There is untested read-write support for BitKeeper and untested read support for Monotone. See <<support> for more details.

Writing to the file-oriented systems RCS and SRC has some serious limitations because those systems cannot represent all the metadata in a git-fast-export stream. They require rcs-fast-import as a back end; consult that tool’s documentation for details and partial workarounds.

SCCS, SVN, and CVS are supported for read only, not write. For CVS, reposurgeon must be run from within a repository directory (that is, a tree of CVS masters; a CVSROOT is not required). When reading from a CVS top-level directory each module becomes a subdirectory in the reposurgeon representation of the change history. It is also possible to read a repository from within a CVS module subdirectory and lift that individual module.

Perforce (p4) is not directly supported, but the auxiliary program repotool(1) can mirror a p4 repository as a local git repository.

Note that reposurgeon is a sharp enough tool to cut you. It never modifies a repository in place, and it takes care not to ever write a repository in an actually inconsistent state, and will terminate with an error message rather than proceed when its internal data structures are confused. However, there are lots of things you can do with it - like altering stored commit timestamps so they no longer match the commit sequence - that are likely to cause havoc after you’re done. Proceed with caution and check your work.

Also note that, if your DVCS does the usual thing of making commit IDs a cryptographic hash of content and parent links, editing a publicly-accessible repository with this tool would be a bad idea. All of the surgical operations in reposurgeon will modify the hash chains.

Please also see the notes on system-specific issues under Limitations and guarantees.

3. Quick start

reposurgeon accepts commands as command line arguments. For example:

$ reposurgeon "read projectdir" lint

This will read repository into memory from projectdir and check it for problems. The quotes are required so the read command and its argument will be presented to the command interpreter as a single string.

You can use "help" and "help <command>" to learn more about reposurgeon commands, or just read on.

Typically you put these commands in a script that you evolve by experimenting until you got a conversion that suits your tastes and needs. "script" command allows to read such command list from a file.

$ reposurgeon "do projectscript.lift"

As a motivating example, the script below does a Subversion-to-Git lift on a repository named project just under the current directory.

# Load the project into main memory
# Warning: this command is slow because Subversion is slow.
read project

# Transcode Latin-1 characters in comments and attributions
# to UTF-8.
=C transcode latin1

# Check for and report glitches such as timestamp collisions,
# ill-formed committer/author IDs, multiple roots, etc.
lint

# Map Subversion usernames to Git-style user IDs.  In a real
# conversion you'd probably have a lot more of these and you'd
# probably read them in from a separate file, not a heredoc.
authors read <<EOF
fred = Fred Foonly <fred@foonly.net> America/Chicago
jrh = J. Random Hacker <jrh@random.org> America/Los_Angeles
esr = Eric S. Raymond <esr@thyrsus.com> America/New_York
julien = Julien '_FrnchFrgg_' RIVAUD <julien@frnchfrgg.pw> Europe/Paris
db48x = Daniel Brooks <db48x@db48x.net> America/Los_Angeles
ecree = Edward Cree <ecree@solarflare.com> Europe/London
ian = Ian Bruene <ianbruene@gmail.com>
sunshine = Eric Sunshine <sunshine@sunshineco.com>
EOF

# Massage comments into Git-like form (with a topic sentence and a
# spacer line after it if there is following running text). Only
# done when the first line is syntactically recognizable as a whole
# sentence.
gitify

# Tags with the name prefix emptycommit were branch-creation commits
# in Subversion. Usually there's nothing interesting in the comment
# text, but you'll want to browse them and check.  These commands
# save one such tag and delete the rest
tag move emptycommit-23 noteworthy
delete tag /emptycommit/

# Delete remnant .cvsignore files from a past life as CVS.
delete path /.cvsignore$/

# Rename and translate the syntax of ignore files
ignores --defaults --translate

# Often, your Subversion repository was a CVS repository in a past
# life. CVS creates tags named with the suffix "-root" to mark branch
# points for its internal housekeeping, and cv2svn blindly copied them
# even though the Subversion tools don't need that marker. This
# clutter can be tossed.
tag delete /-root$/

# This command illustrates how to use msgin to modify the comment
# text of a commit. In this test we're patching a Subversion revision
# reference because we're going to want to reference-lift it later.
# But this capability could also be used, for example, to add an
# update note to a commit comment that turned out to be incorrect.
msgin <<EOF
Legacy-ID: 23
Check-Text: Referring back to r2.

Referring back to [[SVN:2]].
EOF

# Ensure that all author dates are unique, bumping where necessary. Then
# change cookies like [[SVN:2]] into action stamps that are independent
# of the VCS in use. A typical action stamp looks like this:
# "jrh@random.org!2006-08-14T02:34:56Z".
timequake
stampify

# Sometimes it's useful to drop files from the repo that should
# never have been checked in. 1..$ - selects all commits.
1..$ delete path /documents/.*.pdf/

# Process GNU Changelogs to get better attributions for changesets.
# When a commit was derived from a patch and checked in by someone
# other than its author this can often correct the commit attribution.
changelogs

# It's good practice to add a tag marking the point of conversion.
create tag cutover-git
msgin <<EOF
Tag-Name: cutover-git
Tagger: J. Random Hacker <jrh@random.org> America/Los_Angeles

This tag marks the last Subversion commit before the move to Git.
EOF

# We want to write a Git repository
prefer git

# Do it
rebuild project-git

You can find more usage examples in Tips and Tricks.

4. Supported Version Control Systems

The design objective of reposurgeon is to support history interconversion and repository surgery on every VCS in use on Unix systems. This is a large, messy domain. Only a few systems have top-tier support with good coverage in the reposurgeon test suite; others are theoretically supported but only lightly tested and will probably require reposurgeon development work when they are first encountered in the wild.

4.1. Git

The internal representation reposurgeon uses for repository states losslessly captures Git histories. Both import and export are fully supported.

Git has top-tier support with excellent testing.

4.2. CVS

Reposurgeon can import, but not export, CVS repositories. You must have a copy of the repository master directory, and run reposurgeon pointed at that directory or one of its module subdirectories; a checkout directory won’t do.

Beware that the quality of CVS lifts may be poor, with individual lifts requiring serious hand-hacking. This is due to inherent problems with CVS’s file-oriented model and the fragility of its repository structures; reposurgeon cannot entirely compensate for these flaws. See Working with CVS (cvs) for discussion and workarounds.

CVS has top-tier support with good testing. It has been used for many CVS-to-Git conversions.

It is extremely unlikely that export to CVS will ever be implemented. (This means: somebody would have to pay the maintainer real money to make it happen.)

4.3. Subversion

Reposurgeon can import Subversion dumpfiles or repositories. If you want it to read a repository, you must run it within the top-level directory of the repository itself, not in a checkout directory made from the repository.

Unlike CVS, Subversion repositories have real changesets, and the work in them can effectively always be mapped to equivalent commits in a more modern system. The parent-child relationships among commits will also translate cleanly.

Operator errors, prior conversion with tools like cvs2svn, and gatewaying with git-svn can leave cruft in a Subversion history that needs to be cleaned out. See Working with Subversion (svn) for discussion and workarounds.

Subversion has top-tier support with excellent testing. It has been used for many, many Subversion lifts to Git; this is the best-tested of any of the conversion paths.

It is not likely that export to Subversion will ever be implemented. (This means: somebody would have to pay the maintainer real money to make it happen, but there is a plausible future in which that happens.)

4.4. Mercurial

Mercurial (hg): reposurgeon can import and export Mercurial repositories; see Working with Mercurial (hg) for details. Reposurgeon has been used for successful history conversions from Mercurial to Git.

Mercurial has top-tier support and good test coverage.

See Working with Mercurial (hg) for more details.

4.5. Bazaar/Breezy

Bazaar and Breezy (bzr, brz): These are workalikes. Import and export have only basic test coverage, but reposurgeon has been successfully used for a major Bazaar-to-Git conversion (the history of Emacs).

Some Bazaar extensions - per-commit properties and multiple authors - are directly supported in reposurgeon. On the other hand, nothing in Bazaar corresponds to a Git annotated tag, so those would be lost in a Git-to-Bazaar conversion. It is not believed that any such thing has ever been attempted.

Bazaar has top-tier support, and Breezy is expected to replicate its behavior closely enough to inherit that. Test coverage is only basic, however.

4.6. Second-tier systems

A VCS is "second tier" if import capability is checked in the reposurgeon test suite, but it hasn’t had known use for a conversion in the wild.

RCS: reposurgeon will read an RCS collection. It uses SRC to get and export a fast-import stream to reposurgeon’s importer. This importer does not attempt to make changesets out of cliques of per-file commits; you’ll have to hp this yourself using reposurgeon’s coalesce command.

SRC: reposurgeon can read an SRC collection. The caveats for RCS apply.

Writing to the file-oriented SRC and RCS systems can be done via rcs-fast-import(1). It has some serious limitations because RCS cannot represent all the metadata in a git-fast-export stream. Consult that tool’s documentation for details and partial workarounds.

SCCS: reposurgeon will read an SCCS collection. It uses SRC to get and export a a fast-import stream to reposurgeon’s importer. This importer does not attempt to make changesets out of cliques of per-file commits; you’ll have to do this yourself using reposurgeon’s coalesce command.

Fossil: reposurgeon declares an importer-exporter pair, and will read from a Fossil checkout directory. It uses the native Fossil exporter, which is pretty good but exports only versioned ignore patterns (in .fossil-settings/ignore-glob) not the unversioned ignore-globs set through the Web interface. The Fossil tag model has complexities that an export to Git or other systems will not fully capture. Export also leaves behind other data such as the wiki content, events, and tickets. Also be aware that Fossil does not have separate committer and author fields - that distinction will be lost if you export to it.

darcs: reposurgeon declares an importer-exporter pair, but the capability has only been lightly tested. There are almost certainly undiscovered data-model issues here. There has been no motivation to diagnose these problems, as darcs has seen little use since 2010 and may be extinct in the wild. The support in reposurgeon is maintained in case it is necessary to rescue a legacy darcs history.

4.7. Third-tier systems

These have no coverage in the test suite. Significant issues can be expected.

BitKeeper (bk): As of version 7.3 (and probably earlier versions under open-source licensing) BitKeeper has fast-import and fast-export subcommands, and reposurgeon now knows how to use these. Note that the BitKeeper support is untested - the exporter-importer pair may have unknown bugs.

At one point reposurgeon was used to convert a BitKeeper history (that of the NTP reference implementation), but that was via a custom third-party exporter that no longer works rather than the one that was later built in. The quality of the built-in exporter is unknown.

Monotone (mtn): reposurgeon declares an importer-exporter pair, but the capability has not been tested. Monotone never achieved wide usage and in 2023 appears to have been moribund for over a decade; support for it may be dropped in a future release.

4.8. Indirect support

Perforce (p4): repotool can be used to mirror a remote p4 repository as a local Git repository, and to incrementally resync the mirror; consult the repotool manual page. This support is experimental; it is unknown to the author what (if any) reposurgeon cleanup operations might be required, but a skim of Perforce documentation suggests that mapping Perforce user IDs to a Git-style name/address pair will be desirable.

AccuRev: There are a couple of tools for translating AccuRev repositories to live Git repositories. Of these the most developed appears to be called "ac2git"; you should be able to find it with a search engine. We recommend you use this tool to get a first-pass conversion to Git, then use reposurgeon to clean up the result. The ac2git converter’s goal is to produce an accurate representation of a collection of AccuRev streams, as multiple disconnected git branches; while there is functionality to identify branch and merge points, actually weaving the streams into a single DAG is something best done in reposurgeon.

4.9. Unsupported VCSes

Pijul, which is still new and in intensive development as of late 2023, will be supported when it grows an importer/exporter pair for Git fast-import streams.

Arch, ArX, DCVS: these systems were open-source but have been dead since 2006 (with Arch officially deprecated by its maintainer in favor of Bazaar) and no revival of any of them seems likely. There is no support for these and no plan to write any.

QVCS is open source but (as of late 2023) has been moribund since 2010. There is no support and no plan to write any.

Code Coop: open-source but undocumented. As of late 2023 it has been untouched since 2018. There is no support and no plan to write any.

AutoDesk Vault, Azure DevOps Server, CADES, Dimensions CM, ClearCase, PVCS, StarTeam, Vault, Visual SourceSafe, Unity Version Control: These proprietary systems are mentioned for completeness. There is no support for these and no plan to write any. This could change under the following circumstances: either the vendor supplies an exporter/importer pair, or the vendor documents a CLI sufficient for an extractor class. These will probably never have first-class support, unless somebody with a lot of money decides to pay for it.

5. A Guide to Repository Conversion

One of the main uses for reposurgeon is converting repositories between different version-control systems. Since around 2015 has almost always meant converting from something else to Git, but reposurgeon keeps other possibilities open.

This section is a guide to up-converting your repository, and adopting practices that will reduce process friction to a minimum. It is meant to provide context for the description of reposurgeon’s features in later sections.

If you are aiming at something other than a repository conversion, you can safely skip this section.

In 90% of cases you’ll be converting from CVS or Subversion, and those are the cases we’ll discuss in detail.

5.1. Why convert with reposurgeon?

Reposurgeon is more difficult to use than any of dozens of fully-automated conversion tools out there; you have to make choices and compose a recipe. This section explains why it’s worth the bother.

In brief, it’s because fully-automated converters don’t work very well. They are very poor at dealing with the ontological mismatches between the data models of different version-control systems. For detailed discussion of the technical flaws in many common converters, see Appendix A.

In particular: reposurgeon is the only conversion tool that handles multi-branch Subversion repositories in full generality. It can even correctly translate Subversion commits that alter multiple branches.

But even automated converters that are relatively good at bridging data-model differences tend to produce crude, jackleg, unidiomatic conversions that make the seam between the pre-conversion and post-conversion parts of the repository very obvious.

A central example of this is commit references in change comments. These references convey important information to anyone reading the comments, and it is correspondingly important to change them from using the reference format of the old system to one that is intelligible in whatever your new one is.

As another example, git has a convention about the form of change comments; they’re supposed to consist of a standalone summary line, followed optionally by a spacer blank line and running text. Git relies on this convention to produce log summaries that are easy to take in at a glance.

Older version-control systems don’t have this convention. An ideal conversion changes as many comments as possible to be in Git-like form so that the Git summary tools see the data regularity they want. But this kind of editing can’t be fully automated. The best you can hope for, if you want to do it right, is that your tool automates as much of this fixup as it can and it assists a human operator in applying fixups.

Neither reference-lifting nor patching comments for Git-friendliness is a process that can be fully automated. Both require human judgment; accordingly, fully-automated converters don’t even try to do the right thing. The result is often a history that is full of unpleasant little speedbumps and distractions. These induce wasted developer effort and, correspondingly, higher defect rates.

On the other hand, a skilled operator of reposurgeon can produce a conversion that is fully idiomatic in the target system, significantly lowering future friction costs for developers browsing the history.

One fully automated reposurgeon feature of some significance that no other importer supports is that it can parse ChangeLogs in histories which use that Free Software Foundation convention, and use the attributions in them to fill in Git author fields. This recovers better information about the provenance of changesets corresponding to patches committed by a project developer (who continues to be recorded as the committer of that changeset).

5.2. Commercial Note

If you are an organization that pays programmers and has a requirement to do a repository conversion, the author can be engaged to perform or assist with the transition. You are likely to find this is more efficient than paying someone in-house for the time required to learn the tools and procedures. I (the author) have been very open about my methods here, but nothing substitutes for experience when you need it done quickly and right.

If you are wondering why it’s worth spending any money at all for a real history conversion, as opposed to just starting a new repository with a snapshot of the old head revision, the answer comes down to two words: risk management.

Suppose you do a snapshot conversion, head revision only. Then you get a regression report with a way to reproduce the problem. What you want to do is bisect in the new history to identify the revision where the bug was introduced, because knowing what the breaking change was makes a fix far easier. Bzzzt! You can’t. That history is missing in the new system.

Yes, in theory you could run a manual bisection using bracketing builds in new and old repositories. Until you have tried this, you will have no comprehension of how easy it is to get that process slightly but fatally wrong, and (actually more importantly) how difficult it is to be sure you haven’t gotten it wrong. This is the kind of friction cost that sounds minor until the first time it blows up on you and eats man-weeks of NRE.

So congratulations, tracing the bug just got an order of magnitude more expensive in engineer time, and your expected time to fix changed proportionally. It typically only takes one of these incidents to justify the up-front cost of having had the conversion done right.

If you go the snapshot-conversion route, maybe you’ll get lucky and never need visibility further back. Or maybe you’ll have a disaster because you increased the friction costs of debugging just enough that you, say, miss a critical ship date. The more experienced with in-the-trenches software development you are, the more plausible that second scenario will sound.

A subtler issue is that by losing the old change comments you have thrown away a great deal of hard-won knowledge about why your code is written the way it is. Again, this may never matter – but if it does, it’s going to bite you on the butt, hard, probably when you least expect it.

And if you’re thinking "No problem, the old repository will still be around"…​heh. Repositories that have become seldom-accessed are like other kinds of dead storage in that they have a way of quietly disappearing because after a few job turnovers the knowledge of why they’re important is lost. Typically you don’t find out this has happened until you have an unanticipated urgent need, at which point whatever trouble you were in gets deeper.

Spending the relatively small amount it takes to have a proper full conversion done right is a way of bounding your downside risk. If you aren’t a software engineer and had trouble following the preceding argument, propose a snapshot conversion to the engineer you trust the most and watch that person reaching for a diplomatic way to tell you it’s a stupid, shortsighted idea.

5.3. Step Zero: Preparation

Make sure the tools in the reposurgeon suite (especially reposurgeon and repotool) are on your $PATH.

Create a scratch directory for your conversion work.

Run "repotool initmake" in the scratch directory; it requires that you follow the initmake verb with a project name. This will create a Makefile designed to sequence your conversion, and an empty lift script. Then set the variables near the top appropriately for your project.

This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.

The most important variables to set in the Makefile are the ones that set up local mirroring of your repository. The repotool command has a mode that handles the details of making (and, when necessary, updating) a local mirror. To enable this you need to fill in either REMOTE_URL or CVS_HOST and CVS_MODULE; read the header comment of the conversion makefile for details.

If you’re lifting a Subversion repository you can specify the repository URL in either of two ways: as a standard Subversion repository URL (service prefix "svn:") or as an rsync URL pointing at the same repository master directory (service prefix "rsync:"). Usually rsync mirroring is faster, but it depends on sshd running at the server end and may being in complications around its security features. Use a "svn:" URL, which uses mirroring by svnsync, when you can’t make rsync work.

Note by the way that repotool mirror’s "rsync:" URLs do not have the requirement for a remote rsyncd that an rsync URL fed directly to rsync itself does; internally, repotool turns them into a single-colon host plus path rsync source specification.

If what you have is a Subversion checkout directory, the following command will tell you the URL for its source repository. You may

svn info --show-item repos-root-url

If your first access attempt failed with some message about access control, you may need to uncomment the RUSERNAME and RPASSWORD and set them to a valid credential pair.

Later, you will put your custom commands in the lift script file. Doing this helps you not lose older steps as you experiment with newer ones, and it documents what you did.

Doing a high-quality repository conversion is not a simple job, and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.

In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) in these instructions with your project name.

You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.

5.4. Step One: The Author Map

You can skip this section if you’re converting from a DVCS.

SCCS, RCS, CVS, and Subversion identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you’ll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you’re converting. Each line should be in the following form:

foonly = Fred Foonly <foonly@foobar.com>

You can optionally specify a third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezone file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.

Using the generic Makefile for Subversion, "make stubmap" will generate a start on an author-map file as $(PROJECT).map. Edit in real names and addresses (and optionally offsets) to the right of the equals signs.

How best to get this information will vary depending on your situation.

  • If you can get shell access to the repository host, looking at /etc/passwd will give you the real name corresponding to each username of a developer still active: usually you can simply append @ and the repository hostname to each username to get a valid email address. You can do this automatically, and merge in real names from the password file, using the 'repomapper' tool from the reposurgeon distribution.

  • If the repository is owned by a project on a forge site, you can usually get the real name information through the Web interface; try looking for the project membership or developer’s list information.

  • If the project has a development mailing list, posting your incomplete map with a request for completions often gives good results.

  • If you can download the archives of the project’s development mailing list, grepping out all the From addresses may suggest some obvious matches with otherwise unknown usernames. You may also be able to get timezone offsets from the date stamps on the mail. The repomapper tool can mine matching addresses from mailbox files automatically, though it does not extract timezones.

If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name, a preferred email address to be used in the mapping, and a timezone offset. The reason for this is that some sites, like OpenHub, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.

Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.

5.5. Step Two: Conversion

Install whatever front end reposurgeon needs to read your repository. That will usually be cvs-fast-export for CVS, or the VCS tool itself for Subversion and other systems. If you don’t have a tool you need, reposurgeon will bail out gracefully and inform you what’s missing.

The generic-workflow Makefile will call reposurgeon for you, interpreting your $(PROJECT).lift file, when you type "make". You may have to watch the baton spin for a few minutes. For very large repositories it could be more than a few minutes.

This will convert your repository to git. If you need to export to something else, reposurgeon has write support for several other modern VCSes.

If you are exporting from CVS, it may be a good idea to run some trial conversions with cvsconvert, a wrapper script shipped with cvs-fast-export. This script runs a conversion direct to git; the advantage is that it can do a comparison of the repository histories and identify problems for you to fix in your lift script. You probably don’t want to use this for final conversion, though, as it does not clean up CVS junk tags, perform reference lifting, or Gitify comments.

For more detailed discussion of CVS conversion and troubleshooting see Working with CVS (cvs).

On the other hand, while Subversion conversion has its complexities it is robust and well-tested. Normally reposurgeon will do a complete branch analysis for you. On most Subversion repositories, and in particular anything with a standard trunk/tags/branches layout, it will do the right thing. It will also cope with adventitious branches in the root directory of the repo, such as some projects use for website content.

There is, however, a minor problem around tags, and a slightly more significant problem around Subversion merges. Also, some Subversion repositories are multi-project with a nonstandard directory layout.

For more detailed discussion of Subversion conversion and troubleshooting see Working with Subversion (svn). For discussion of handling multiproject repositories, see Multiproject Subversion repositories.

5.6. Step Three: Sanity Checking

Before putting in further effort on polishing your conversion and putting it into production, you should check it for basic correctness.

Pay attention to error messages emitted during the lift. Most of these, and remedial actions to take, are described in this guide.

For Subversion lifts, use the "headcompare", "tagscompare" and "branchescompare" productions to compare the converted with the unconverted repository. If you didn’t use the cvsconvert wrapper for your CVS lift, these productions have a similar effect. Be aware that these operations may be extremely slow on large Subversion repositories.

The only differences you should see are those due to keyword expansion and ignore-file lifting. If this is not true, you may have found a serious bug in either reposurgeon or the front end it used, or you might just have a misassigned tag that can be manually fixed. Consult How to report bugs for information on how to usefully report bugs.

Use reposurgeon’s ‘lint’ command to find anomalies like detached branches that may need manual correction.

If you are converting from CVS, use reposurgeon’s ‘view’ command to examine the conversion, looking (in particular) for misplaced tags or missing branch joins. Often these can be manually repaired with little effort. These flaws do 'not' necessarily imply bugs in cvs-fast-export or reposurgeon; they may simply indicate previously undetected malformations in the history. However, reporting them may help improve cvs-fast-export.

5.7. Step Four: Cleanup

You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it.

Here’s a checklist of cleanup steps, which we’ll expand on later in this section. If you’re using the makefile generated by repotool, some of these will be done by commands in your lift script.

5.7.1. Apply your author map

This step applies to SCCS, RCS, CVS, and SVN. Map author IDs from local to DVCS form. The reposurgeon command for this is "authors read"; see ‘authors read’.

5.7.2. Mine changelogs

If you don’t know what a GNU ChangeLog is you can skip this step.

If you’re converting form a VCS that doesn’t have separate committer and author attributions (e.g. CVS, SVN, Fossil), and the project you’re converting used the GNU ChangeLog convention, include "changelogs" somewhere in your lift script to capture authorship information.

The command:

changelogs

5.7.3. Clean up after legacy character sets

You may have metadata in your repository in an encoding incompatible with UTF-8. The most common glitch of this kind is Latin-1 characters in committer names.

Unfortunately, this process can’t be automated; the fact that the encoding of a text in a legacy character set like Latin-1 or GB18030 can’t be deduced just by looking at the bits is exactly what led to the development of Unicode and UTF-8.

The selector =I will choose all commits with attributions and comments that don’t decode to UTF-8 in both the commit comment and attribution parts. You can use =I list inspect to view them.

It’s up to you to figure out what the encoding is and apply the ‘transcode’ command to re-encode in UTF-8. You can test the results of applying different encodings using the --decode option of the ‘list’ or ‘msgout’ commands.

Example commands:

# Try out Shift_JIS decoding to see if it throws errors
=I msgout --decode=Shift_JIS

# Apply latin1 decoding to UTF-8
=I transcode latin1

5.7.4. Convert ignore files

If you’re using reposurgeon for a CVS- or Subversion-to-git conversion cvs-fast-export and reposurgeon will convert ignore patterns for you automatically, both expressing Subversion svn:ignore and svn:global-ignores properties as .gitignore files and lifting .cvsignore files to .gitignore files.

Any .gitignore files found in a Subversion repository before conversion were almost certainly created by git-svn users ad-hoc and will be discarded by the importer; it is up to the human doing the conversion to look through them and rescue any ignore patterns that should be merged into the converted repository. This behavior can be reversed with the --user-ignores option, which will retain that information and merge it with the ignore patterns generated from svn:ignore and svn:global-ignores properties.

Other conversions may need this command:

ignores --translate --defaults

5.7.5. Remove junk tags

Your conversion may contain mechanically-generated tags (both lightweight and annotated) that convey no actual information. There are clutter and should be removed

You can examine the repository’s tags with

list tags

CVS makes tags that are branch root markers for internal purposes. These are not deleted automatically by cvs-fast-export because implicit magic that deletes data generally turns out to be a bad idea. They often persist through up-conversions to SVN. This command will clean these up:

delete tag /-root$/

SVN commits with no fileops are automatically transformed into annotated tags when reading a Subversion repository. These are especially likely to occur if you are converting a Subversion repository that had a shady past as a CVS tree and was converted using cvs2svn. This command will clean these up:

delete tag /emptycommit/

5.7.6. Remove junk commits

Importers may generate commits that are empty (have no fileops) for various reasons. You will probably want to delete these; they’re preserved just in case something about the metadata is interesting.

You can review empty commits with:

=Z list

You can save any that have interesting comments as tags attached to their parent commit with ‘tagify’.

The rest of this instruction step applies only if you are converting a Subversion repository.

Junk generated by cvs2svn to carry tag information lurk in the history of many Subversion projects. When these junk commits are empty, either the importer will already have tagified them (in which case you cleaned them up in the previous step) or you just caught them with "=Z list".

Less commonly, generated junk commits have long lists of spurious delete fileops. These can be listed with the =D selector:

=D list

The trickiest kind has actual file content duplicating parent file versions, or referring randomly to file versions far older than the junk commit. You can sometimes spot these because they have commit comments that are empty or consist of the string "* empty log message *",

Either view or graph can be q good way to spot junk commits. Use them to eyeball the picture of the commit DAG created by the reposurgeon 'graph' command - they tend to stand out visually as leaf nodes in odd places. Be aware that the graph command outputs DOT, the language interpreted by the graphviz suite; you will need a DOT rendering program and an image viewer. Unfortunately, for large repositories the sheer size of the graph image makes this impractical.

5.7.7. Fix up references in commit comments

Different version-control systems have different conventions for how a commit comment can refer to a previous commit. Modern DVCSes usually use opaque hexadecimal strings that express some hash function of the content of the commit and its ancestors. Subversion, which is centralized, uses integer revision numbers starting from 1. CVS occasionally uses per-file versions that look like numerals separated by dots.

What all these have in common is that they stop being useful when you move your history to a new version-control system. There are two different ways to deal with this, and you need to make a policy decision about which fits your needs.

One way is to use ‘append’ to add a line to each commit giving its identifier. This allows you to chase references by searching in the text of comments. This is a good choice if you have references to commits outside the repository (say, in mailing-list archives) that you want to stay valid.

The other way is to convert these references into action stamps - identifiers that are independent of the VCS you are in, This has the advantage of being future-proof and cluttering your conversion less.

You due the latter thing with ‘stampify’, But there is a preparation step you have to do first, which is to hack the existing references into reference cookies that ‘stampify’ can see.

There’s a detailed of this process and its commands at Reference lifting.

5.7.8. Massage comments into summary-line-plus-continuation form.

git and hg both encourage comments to begin with a summary line that can stand alone as a short description of the change; this practice produces more readable output from git log and hg log.

For a really high-quality conversion, multi-line comments should be edited into this form. Some of this can be done with, ‘gitify’, but you’ll need to go through and fix remaining cases by hand. The =L selector is good for finding these.

Here is a workflow for this cleanup step:

  1. Do "=L msgout --id >message.box" to dump comments that lack a spacer line after the first to a mailbox file. The actual name of the mailbox file doesn’t matter of course.

  2. Use your preferred text editor to fix up the comments.

  3. Do "msgin <message.box" to apply the fixes.

5.7.9. Patch in missing branch merges.

You can skip this step if you’re converting from a DVCS, it will already have branch merges expressed. But CVS has no concept of merging branches at all. Under Subversion, reposurgeon tries to interpret svn:mergeinfo properties to create merges automatically, but Subversion merging is tricky enough that these properties are often not set when they could have been.

You can review your branch tip commits with the =H selector. By examining commits near in time to each tip, you may be able to identify when there is an unexpressed branch merge. Under Subversion, look for svn:mergeinfo properties as clues (but see the warnings about this at Working with Subversion (svn)).

You can use ‘merge’ to fix these up. This is safe because adding a merge link doesn’t modify any later content; it’s only used for ancestry tracking when computing diffs.

5.7.10. Subversion only: Review branch tip deletes and deletealls

In Subversion it is common practice to delete a branch directory when that line of development is finished or merged to trunk; this makes sense because it reduces the checkout size of the repo in later revisions.

In a DVCS, deletes at a branch tip don’t save you any storage, so it makes more sense to leave the branch with all of its tip content live if you’re not going to delete it entirely.

It;s a judgment call whether to delete these. They could be considered clutter, or they could be considered documentation of the fact that development on a branch has closed.

5.7.11. CVS only: Resolve zombie files

CVS conversions occasionally have [zombie] files - that is, files which persist in conversion commits after they were deleted in the CVS repository.

The easiest way to detect these is to run a trial conversion with cvsconvert(1). It will report content mismatches, including this kind. You will be able to fix them by patching D fileops into the history.

There is a more detailed discussion of this issue at Working with CVS (cvs).

5.7.12. CVS only: remove ~/.cvsrc history (optional)

Optionally, do this to remove a CVS relic

delete path .cvsrc

This file, if it existed, set options for CVS commands. It will not be relevant in your new VCS and is just clutter.

5.7.13. Subversion only: check for a root branch

After conversion of a branchy Subversion repository, look to see if there is a 'root' branch. If there are any commits with a sufficiently odd that reposurgeon can’t figure out what branch they belong to, they’ll wind up there.

This command will tell you if you have a root branch in your conversion:

list names

If this happens, it’s likely those commits will be junk. It’s up to you to delete or transplant them appropriately.

5.7.14. Fix up or remove $-keyword cookies in the latest revision.

One minor feature you lose in moving from SCCS, RCS, CVS, Subversion, or BitKeeper to a DVCS is keyword expansion. There was a practice of embedding magic cookies in master files that would be expanded on checkout with various metadata like the commit date and committer ID. These are useless clutter under modern VCSes and should be removed.

You should go through the last revision of the code and remove $Id$, $Date$, $Revision$, and other keyword cookies lest they become unhelpful relics. The full Subversion set is $Date$, $Revision$, $Author$, $HeadURL$ and $Id$. CVS uses $Author$, $Date$, $Header$, $Id$, $Log$, $Revision$, also (rarely) $Locker$, $Name$, $RCSfile$, $Source$, and $State$. A command like grep -R '$[A-Z]' . may be helpful.

5.7.15. Run lint to detect remaining anomalies

Run lint to detect remaining anomalies that might need to be patched.

5.7.16. Record the conversion into the history

It’s good practice to leave an annotated tag at the conversion point noting the date and time of the repo lift. See the next section on conversion comments for discussion.

Here’s an example of how to make an informative conversion tag:

msgin --create <<EOF
Tag-Name: git-conversion

Marks the spot at which this repository was converted from Subversion to git.

Conversion notes are enclosed in double square brackets. Junk commits
generated by cvs2svn have been removed, commit references have been
mapped into a uniform VCS-independent syntax, and some comments edited
into summary-plus-continuation form.
EOF

Experiments with reposurgeon suggest that git import doesn’t try to pack or otherwise optimize for space when it populates a repo from a dump file; this produces large repositories. Running git repack and git gc --aggressive can slim them down quite a lot.

5.7.17. Garbage-collect the conversion

If your target was git, run git gc --aggressive. This can reduce it in size considerably.

5.7.18. A note on conversion comments

Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.

It’s helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. Enclosing translation notes in [[ ]] is recommended; this has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.

5.7.19. A note on recovering from errors

Occasionally you’ll discover problems with a conversion after you’ve pushed it to a project’s hosting site, typically to a bare repo that the hosting software created for you. Here’s how to cope:

  1. Do your surgery on a copy of the repo with its .git/config pointing to the public location.

  2. Warn the public repo’s users that it is briefly going out of service, and they will need to re-clone it afterwards!

  3. Ensure that it is possible to force-push to the repository. How you do this will vary depending on your hosting site.

  4. On gitlab.com, under Settings, there is a "Protected Branches" item you can use. If you unprotect a branch, you can force-push to it.

    Elsewhere, you may be able to re-initialize the public repo (this works, for example, on SourceForge). You’ll need ssh access to the bare repo directory on the host - let’s suppose it’s 'myproject'. Pop up to the enclosing directory and do this:

        mv myproject myproject-hidden
        rm -fr myproject-hidden/*
        git init --bare myproject-hidden
        mv myproject-hidden myproject

    The point of doing it this way is (a) so you never actually remove myproject (on many hosts you will not have create permissions in the enclosing directory), and (b) so no user can update the repo while you’re clearing it (mv is atomic).

    Here’s a script that will do the job on SourceForge:

    #!/usr/bin/expect -f
    #
    # nuke - nuke a SourceForge repo
    #
    # usage: nuke project [userid]
    #
    
    if {$argc < 1} {
        puts "nuke: project name argument is required"
        exit 1
    } else {
        set project [lindex $argv 0]
        set user $env(USER)
        if {$argc >= 2} {
    	set user [lindex $argv 1]
        }
    }
    
    set remoteprompt "bash-4.1"
    
    set timeout -1
    spawn $env(SHELL)
    match_max 100000
    send -- "ssh -t $user@shell.sourceforge.net create"
    expect -exact "ssh -t $user@shell.sourceforge.net \r create"
    send -- "\r"
    expect -exact "$remoteprompt\$ "
    send -- "cd /home/git/p/$project\n"
    expect -exact "$remoteprompt\$ "
    send -- "cd git-main.git\n"
    expect -exact "$remoteprompt\$ "
    send -- "rm -fr *\n"
    expect -exact "$remoteprompt\$ "
    send -- "git init --bare .\n"
    expect -exact "$remoteprompt\$ "

    After re-initializing, you should be able to run git push to push the new history up to the public repo.

  5. From your modified local repo, try

         git push --mirror --force

    to push the new history up to the public repo.

  6. Inform the public repo’s users that it is available and remind them that they will need to re-clone it.

On GitLab, you can get a similar effect by unprotecting all branches and doing a git push --force to unconditionally overwrite the public history. It is good practice to re-protect the branches afterwards.

5.8. Step Five: Client Tools

Developers who are already git fans and know how to use a git client will, of course, have no particular trouble using a git repository.

Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.

If you have developers attached to the CVS interface, it is possible (and in fact relatively easy) to set up a gateway interface that lets them continue using their CVS client tools. Consult the documentation for git-cvsserver.

5.9. Step Six: Good Practice

Educate your developers in the following good practices:

5.9.1. Commit references

The combination of a committer email address with a ISO8601 timestamp is a good way to refer to a commit without being VCS-specific. Thus, instead of "commit 304a53c2", use "<2011-10-25T15:11:09Z!fred@foonly.com>". It is recommended that you not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T' or '!'. Making these cookies uniform and machine-parseable will have good consequences for future repository-browsing tools. The reference-lifting code in reposurgeon generates them.

Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.

Sometimes it’s enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out Attempted divide-by-zero fix."

When appropriate, "my last commit" is simple and effective.

5.9.2. Comment format

As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a separating blank line and details in whatever form the commenter likes.

Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "…​" is conventional.

For best results, stay within 50 characters for summary lines and within 72 characters for detail lines. Don’t go over 80. The Linux kernel developers like to stay at 65 or under.

Good comment practice produces more readable output from git log and hg log, and makes it easy to take in whole sequences of changes at a glance.

Consider adopting the guidelines at Conventional Commits

6. Theory of Operation

6.1. The outside view

As the quick-start example shows, you’re typically going to do three steps when you use reposurgeon: (1) read in one (or more) repositories, (2) do surgical things on them, and (3) write out one (or more) repositories.

To keep reposurgeon simple and flexible, it normally does not do its own repository reading and writing. Instead, it relies on being able to parse and emit the command streams created by git-fast-export and read by git-fast-import. This means that it can be used on any version-control system that has both fast-export and fast-import utilities. The git-import stream format also implicitly defines a common language of primitive operations for reposurgeon to speak.

In order to deal with version-control systems that do not have fast-export equivalents, reposurgeon can also host extractor code that reads repositories directly. For each version-control system supported through an extractor, reposurgeon uses a small amount of knowledge about the system’s command-line tools to (in effect) replay repository history into an input stream internally. Repositories under systems supported through extractors can be read by reposurgeon, but not modified by it. In particular, reposurgeon can be used to move a repository history from any VCS supported by an extractor to any VCS supported by a normal importer/exporter pair.

Mercurial repository reading is implemented with an extractor class; writing is handled with the "hg-git-fast-import" command. A test extractor exists for git, but is normally disabled in favor of the regular exporter.

Subversion is an important exception. Its exporter is ‘svnadmin dump’, which doesn’t ship a git-fast-import stream, but rather the unique dump format supported by Subversion. Reposurgeon contains an interpreter for this stream format.

As a matter of historical interest, some old versions of reposurgeon had the ability to build a Subversion repository on output by synthesizing a Subversion dump stream and feeding it to ‘svnadmin load’. This feature was a cute stunt, but was abandoned during translation to Go for a couple of reasons. Most importantly, there is zero demand for moving histories to Subversion - and supposing there were, moving content and metadata from git’s DAG representation to a Subversion stream is very lossy. Subversion to Git to Subversion wouldn’t even have round-tripped well.

6.2. The inside view

Between reads and writes, reposurgeon can usefully be thought of as a structure editor for directed acyclic graphs with a pre-defined set of attributes on their nodes.

To get a feel for what that graph is like, it’s helpful to have seen a git-fast-import stream file. Here is a trivial example from the reposurgeon test suite, describing a history with two commits to a single file:

blob
mark :1
data 20
1234567890123456789

commit refs/heads/master
mark :2
committer Ralf Schlatterbeck <rsc@runtux.com> 0 +0000
data 14
First commit.
M 100644 :1 README

blob
mark :3
data 20
0123456789012345678

commit refs/heads/master
mark :4
committer Ralf Schlatterbeck <rsc@runtux.com> 10 +0000
data 15
Second commit.
from :2
M 100644 :3 README

A git-fast-import stream consists of a sequence of commands which must be executed in the specified sequence to build the repo; to avoid confusion with reposurgeon commands we will refer to the stream commands as events in this documentation. These events are implicitly numbered from 1 upwards. Most commands require specifying a selection of event sequence numbers so reposurgeon will know which events to modify or delete.

For all the details of event types and semantics, see the git-fast-import(1) manual page; the rest of this paragraph is a quick start for the impatient. The most prominent events in a stream are commits describing revision states of the repository; these group together under a single change comment one or more fileops (file operations), which usually point to blobs that are revision states of individual files. A fileop may also be a delete operation indicating that a specified previously-existing file was deleted as part of the commit; there are a couple of other special fileop types of lesser importance.

Reposurgeon’s internal representation of a repository history is basically a deserialized git fast-import stream. A few extra attributes are supported; most notably, commits and resets have a legacy-id attribute that carries over the object’s ID from whatever version-control system exported the stream, in particular a Subversion or CVS revision number.

6.3. The interpreter view

The program can be run in one of two modes, either as an interactive command interpreter or in batch mode to execute commands given as arguments on the reposurgeon invocation line.

The only differences between these modes are (1) the interactive one begins by turning on the ‘interactive’ option, (2) in batch mode all errors (including normally-recoverable errors in selection-set syntax) are fatal, and (3) each command-line argument beginning with ‘--’ has that stripped off (which in particular means that --help and --version will work as expected).

Also, in interactive mode, Ctrl-P and Ctrl-N will be available to scroll through your command history, and tab completion of command keywords, options, and arguments (wherever that makes semantic sense) is available. Entering a tab just after the command keyword will show you subcommands and options.

Commands that modify the history don’t normally generate summaries of what bits they have changed. Instead there is a general mechanism called "Q bits" that helps you find and examine recently-modified things. See Q bits for details.

It is expected that interactive mode will be used mainly for exploring repository metadata, while conversion experiments will be captured in a script that is gradually improved until the day final cutover can be performed and the old repository decommissioned.

Note that this means the old repository can be left in service while the conversion recipe is under development. Recipe development should be treated as a serious project with its own change tracking.

7. The Command Interpreter

7.1. Command syntax

To give you the flavor of the command language, here are some simple examples:

29..71 list             ;; list summary index of events 29..71.

236..$ list             ;; List events from 236 to the last.

<#523> list inspect     ;; Look for commit #523; they are numbered
                        ;; 1-origin from the beginning of the
                        ;; repository.

<2317> list inspect     ;; Look for a tag with the name 2317, a tip
                        ;; commit of a branch named 2317, or a commit
                        ;; with legacy ID 2317. Inspect what is found.
                        ;; A plain number is probably a legacy ID
                        ;; inherited from a Subversion revision
                        ;; number.

/regression/ list       ;; list all commits and tags with comments or
                        ;; committer headers or author headers
                        ;; containing the string "regression".

1..:97 & =T delete      ;; delete tags from event 1 to mark 97.

[Makefile] list inspect ;; Inspect all commits with a file op touching
                        ;; Makefile and all blobs referred to in a
                        ;; fileop touching Makefile.

@dsc(:55) list          ;; Display all commits with ancestry tracing
                        ;; to :55.

@min([.gitignore]&=C) remove .gitignore
                        ;; Remove the first .gitignore fileop in the
                        ;; repo.  In a Subversion lift this contains
                        ;; patterns corresponding to Subversion default
                        ;; ignores.

Each command description begins with a syntax summary. Mandatory parts are bare or in in {}, optional in [], and …​ says the element just before it may be repeated. Parts separated by | are alternatives. Parts in ALL-CAPS are expected to be filled in by the user.

Commands are distinguished by a command keyword. Most take a selection set immediately before it; see "help selection" for details. Some commands have a following subcommand keyword.

Many commands take additional arguments after the command (and subcommand, if present). Arguments can be either bare tokens or string literals enclosed by double quotes; the latter is in case you need to embed whitespace in a pathname, regular expression, or text string.

Some commands support option flags. These are led with a --, so if there is an option flag named "foo" you would write it as "--foo". Option flags can be anywhere on the line. The order of option flags is never significant. When an option flag "foo" sets a value, the syntax is --foo=xxx with no spaces around the equal sign. The value part may be a double-quoted string containing whitespace.

The embedded help for some commands tells you that they interpret C/Go style backslash escapes like \n in arguments. Interpretation uses Go’s Quote/Unquote codec from the strconv library. In such arguments you can, for example, get around having to include a literal # in an argument by writing "\x23".

Some commands take following arguments that are regular expressions. In this context, they still require start and end delimiters as they do when used in a selection prefix, but if you need to have a / in the expression the delimiters can be any punctuation character other than an ASCII single quote. As a reminder, these are described in the embedded help as delimited regular expressions.

Following-argument regular expressions may not contain whitespace; if you need to specify whitespace or a non-printable character use one of the escapes that Go regular expession syntax allows, such as \s or \t.

A command argument with a name containing PATTERN may be either a delimited regular expression or a literal string; if it is not recognized as the former it will be treated as the latter. If the delimited regular expression starts and ends with ASCII single quotes, those will be stripped off and the result treated as a literal string.

Command lines beginning with "#" are treated as comments and ignored. If a command line has a trailing portion that begins with one or more whitespace characters followed by "#" and is not inside a string, that trailing portion is ignored.

When a command changes repository state, it will usually so indicate in a response.

7.2. Finding your way around

Help is always available.

help [COMMAND]

Follow with space and a command name to show help for the command.

Without an argument, list help topics.

"?" is a shortcut synonym for "help".

If required, and $PAGER is set, help items long enough to need it will be fed to that pager for display.

Command history is always available.

history

Dump your command list from this session so far.

You can do Ctrl-P or up-arrow to scroll back through the command history list, and Ctrl-N or down-arrow to scroll forward in it. Tab-completion on command keywords is available in combination with these commands.

You don’t need to exit the interpreter to run quick shell commands.

shell [COMMAND-TEXT]

Run a shell command. Honors the $SHELL environment variable.

"!" is a shortcut for this command.

Here’s how to bail out:

quit

Terminate reposurgeon cleanly.

Typing EOT (usually Ctrl-D) is a shortcut for this.

7.3. Regular expressions

Regular expressions are an important building block of the command language, used both in event selections and various command arguments.

The pattern expressions used in event selections and various commands (attribute, changelogs, delete, filter, list, move, msgout, rename) are either literal strings or use the regular-expression syntax of the Go language.

Patterns intended to be interpreted as regular expressions are normally wrapped in slashes (e.g. /foobar/ matches any text containing the string "foobar"), but any punctuation character other than single quote will work as a delimiter in place of the /; this makes it easier to use an actual / in patterns.

In this case matching is unanchored - any match to a substring of the search space succeeds. You can use ^ and $ to anchor a regular expression to the beginning or end of the search space.

Matched single quote delimiters mean the literal should be interpreted as plain text, suppressing interpretation of regexp special characters and requiring an anchored, entire match. The pattern is also interpreted as a literal string requiring an anchored, entire match if the start and end character are different.

When interpreting a pattern expression after the command verb, string double quotes are stripped off first and so not affect whether it is interpreted as a regexp as a literal string. However, such a double-quoted string may contin whitespace and still be interpreted as a single argument.

Pattern expressions following the command verb may not contain literal whitespace unless string-quoted; use \s or \t if you need to, or string-quote the expression. Event-selection regexps (before the command) may contain literal whitespace.

Some commands support regular expression flags, and some even add additional flags over the standard set. The documentation for each individual command will include these details.

7.4. Selection syntax

Commands to reposurgeon consist of a command keyword, usually preceded by a selection set, sometimes followed by whitespace-separated arguments. It is often possible to omit the selection-set argument and have it default to something reasonable. For commands that are considered safe (no side effects) the default is all events; for risky commands the default is no events.

A selection set is ordered; that is, any given element may occur only once, and the set is ordered by when its members were first added.

The selection-set specification syntax is an expression-oriented mini-language. The most basic term in this language is a location. The following sorts of primitive locations are supported:

event numbers

A plain numeric literal is interpreted as a 1-origin event-sequence number. It is not expected that you will have to use this feature often.

marks

A numeric literal preceded by a colon is interpreted as a mark; see the import stream format documentation for explanation of the semantics of marks.

tag and branch names

The basename of a branch (including branches in the refs/tags namespace) refers to its tip commit. The name of a tag is equivalent to its mark (that of the tag itself, not the commit it refers to). Tag and branch locations are bracketed with < > (angle brackets) to distinguish them from command keywords.

legacy IDs

If the content of name brackets (< >) does not match a tag or branch name, the interpreter next searches legacy IDs of commits. This is especially useful when you have imported a Subversion dump; it means that commits made from it can be referred to by their corresponding Subversion revision numbers.

commit numbers

A numeric literal within name brackets (< >) preceded by # is interpreted as a 1-origin commit-sequence number.

reset targets

If the previous ways of interpreting a name within brackets don’t resolve, the name is checked to see if it matches a reset. If so, the expression resolves to the commit the reset is attached to.

reset@ names

A name with the prefix ‘reset@’ refers to the latest reset with a basename matching the part after the @. Usually there is only one such reset.

$

Refers to the last event.

These may be grouped into sets in the following ways:

ranges

A range is two locations separated by ‘..’, and is the set of events beginning at the left-hand location and ending at the right-hand location (inclusive).

lists

Comma-separated lists of locations and ranges are accepted, with the obvious meaning.

There are some other ways to construct event sets:

visibility sets

A visibility set is an expression specifying a set of event types. It will consist of a leading equal sign, followed by type letters. These are the type letters:

B

blobs

Most default selection sets exclude blobs; they have to be manipulated through the commits they are attached to.

C

commits

D

all-delete commits

These are artifacts produced by some older repository-conversion tools.

E

earliest commit on each branch

F

all fork (multi-child) commits

H

head (branch tip) commits

I

all commits not decodable to UTF-8

J

all commits with non-ASCII (possible ISO 8859) characters

L

commits with unclean multi-line comments

E.g. without a separating empty line after the first

M

merge (multi-parent) commits

N

Legacy IDs

Any comment matching a cookie (legacy-ID) format

O

orphaned (parentless) commits

P

passthroughs

All event types simply passed through, including comments, progress commands, and checkpoint commands

Q

Recently touched

Set/cleared by many commands

R

all resets

T

all tags

U

commits with callout parents

X

commits in a bisection triple } Earliest/middle/latest

Z

references

A reference name (bracketed by angle brackets) resolves to a single object, either a commit or tag.

type interpretation

tag name

annotated tag with that name

branch name

the branch tip commit

legacy ID

commit with that legacy ID

assigned name

name equated to a selection by assign

Note that if an annotated tag and a branch have the same name foo, <foo> will resolve to the tag rather than the branch tip commit.

dates and action stamps

A date or action stamp in angle brackets resolves to a selection set of all matching commits.

type interpretation

RFC3339 timestamp

commits or tags with that time/date

action stamp <timestamp!email>

commits or tags with that timestamp and author (or committer if no author). Aliases of the author are also accepted.

yyyy-mm-dd part of RFC3339 timestamp

all commits and tags with that date

To refine the match to a single commit, use a 1-origin index suffix separated by #. Thus <2000-02-06T09:35:10Z> can match multiple commits, but <2000-02-06T09:35:10Z#2> matches only the second in the set (by position in the stream file).

text search

A text search expression is a regular expression surrounded by forward slashes (to embed a forward slash in it, use a C-like string escape such as \x2f).

A text search normally matches against the comment fields of commits and annotated tags, or against their author/committer names, or against the names of tags; also the text of passthrough objects.

The scope of a text search can be changed with qualifier letters after the trailing slash. These are as follows:

letter interpretation

a

author name in commit

b

branch name in commit; also matches blobs referenced by commits on matching branches, and tags which point to commits on matching branches.

c

comment text of commit or tag

r

committish reference in tag or reset

p

text in passthrough

t

tagger in tag

n

name of tag

B

blob content

Multiple qualifier letters can add more search scopes.

(The "b" qualifier replaces the branch-set syntax in earlier versions of reposurgeon.)

paths

A "path expression" enclosed in square brackets resolves to the set of all commits and blobs related to a path matching the given expression. The path expression itself is either a path literal or a regular expression surrounded by slashes. Immediately after the trailing / of a path regexp you can put any number of the following characters which act as flags: ‘a’, ‘c’, ‘D’, ‘M’, ‘R’, ‘C’, ‘N’.

If the first character in a path expression is ‘~’, the path expression is negated; that is, it is evaluated for "not matching" rather than "matching".

By default, a path is related to a commit if the latter has a fileop that touches that file path - modifies that change it, deletes that remove it, renames and copies that have it as a source or target. When the ‘c’ flag is in use the meaning changes: the paths related to a commit become all paths that would be present in a checkout for that commit.

A path literal matches a commit if and only if the path literal is exactly one of the paths related to the commit (no prefix or suffix operation is done). In particular a path literal won’t match if it corresponds to a directory in the chosen repository.

A regular expression matches a commit if it matches any path related to the commit anywhere in the path. You can use ^ or $ if you want the expression to only match at the beginning or end of paths. When the ‘a’ flag is in use, the path expression selects commits whose every path matches the regular expression. This is necessarily a subset of commits selected without the ‘a’ flag because it also selects commits with no related paths (e.g. empty commits, deletealls and commits with empty trees). If you want to avoid those, you can use e.g. ‘[/regexp/] & [/regexp/a]’.

The flags ‘D’, ‘M’, ‘R’, ‘C’, ‘N’ restrict match checking to the corresponding fileop types. Note that this means an ‘a’ match is easier (not harder) to achieve. These are no-ops when used with ‘c’.

A path or literal matches a blob if it matches any path that appeared in a modification fileop that referred to that blob. To select purely matching blobs or matching commits, compose a path expression with =B or =C.

If you need to embed ‘[^/]’ into your regular expression (e.g. to express "all characters but a slash") you can use a C-like string escape such as \x2f.

A quick example-centered reference for selection-set syntax.

First, these ways of constructing singleton sets:

123        event numbered 123 (1-origin)
:345       event with mark 345
<456>      commit with legacy-ID 456 (probably a Subversion revision)
<foo>      the tag named 'foo', or failing that the tip commit of branch foo

You can select commits and tags by date, or by date and committer:

<2011-05-25>                  all commits and tags with this date
<2011-05-25!esr>              all with this date and committer
<2011-05-25T07:30:37Z>        all commits and tags with this date and time
<2011-05-25T07:30:37Z!esr>    all with this date and time and committer
<2011-05-25T07:30:37Z!esr#2>  event #2 (1-origin) in the above set

More ways to construct event sets:

/foo/      all commits and tags containing the string 'foo' in text or metadata
           suffix letters: a=author, b=branch, c=comment in commit or tag,
                           C=committer, r=committish, p=text, t=tagger, n=name,
                           B=blob content in blobs.
           A 'b' search also finds blobs and tags attached to commits on
           matching branches.
[foo]      all commits and blobs touching the file named 'foo'.
[~foo]     all commits and blobs other than those for the file named 'foo'.
[/bar/]    all commits and blobs touching a file matching the regexp 'bar'.
           Suffix flags: a=all fileops must match other selectors, not just
           any one; c=match against checkout paths, DMRCN=match only against
           given fileop types (no-op when used with 'c').
[~/bar/]   all commits and blobs touching any file not matching bar
=B         all blobs
=C         all commits
=D         all commits in which every fileop is a D or deleteall
=E         all branch root commits (earliest on branch)
=F         all fork (multiple-child) commits
=H         all head (childless branch tip) commits
=I         all commits with metadata not decodable to UTF-8
=J         all commits with non-ASCII (possible ISO 8859) characters
=L         all commits with unclean multi-line comments
=M         all merge commits
=N         all commits and tags matching a cookie (legacy-ID) format.
=O         all orphan (parentless) commits
=P         all passthroughs
=Q         all events marked with the "recently touched" bit.
=R         all resets
=T         all tags
=U         all commits with callouts as parents
=X         a bisection set: the earliest commit, the latest, and the middle
=Z         all commits with no fileops

@min()     create singleton set of the least element in the argument
@max()     create singleton set of the greatest element in the argument

Other special functions are available: do 'help functions' for more.

You can compose sets as follows:

:123,<foo>     the event marked 123 and the event referenced by 'foo'.
:123..<foo>    the range of events from mark 123 to the reference 'foo'

Selection sets are ordered: elements remain in the order they were added, unless sorted by the ? suffix.

Sets can be composed with | (union) and & (intersection). | has lower precedence than &, but set expressions can be grouped with ( ). Postfixing a ? to a selection expression widens it to include all immediate neighbors of the selection and sorts it; you can do this repeatedly for effect. Do set negation with prefix ~; it has higher precedence than & | but lower than ?.

The selection-expression language has named special functions. The syntax for a named function is "@" followed by a function name, followed by an argument in parentheses. Presently the following functions are defined:

@min()

create singleton set of the least element in the argument

@max()

create singleton set of the greatest element in the argument

@amp()

nonempty selection set becomes all events, empty set is returned

@par()

all parents of commits in the argument set

@chn()

all children of commits in the argument set

@dsc()

all commits descended from the argument set (argument set included)

@anc()

all commits ancestral to the argument set (argument set included)

@pre()

events before the argument set

@suc()

events after the argument set

@srt()

sort the argument set by event number.

@rev()

reverse the selection set

Set expressions may be combined with the operators "|" and "&" which are, respectively, set union and intersection. The "|" has lower precedence than intersection, but you may use parentheses "(" and ")" to group expressions in case there is ambiguity.

Any set operation may be followed by "?" to add the set members' neighbors and referents. This extends the set to include the parents and children of all commits in the set, and the referents of any tags and resets in the set. Each blob reference in the set is replaced by all commit events that refer to it. The "?" can be repeated to extend the neighborhood depth. The result of a "?" extension is sorted so the result is in ascending order.

Do set negation with prefix "~"; it has higher precedence than "&" and "|" but lower than "?".

7.5. Redirection and shell-like features

All commands that expect data to be presented on standard input support input redirection. You may write "<myfile" to take input from the file named "myfile". Input redirections can be anywhere on the line.

Most commands that normally ship data to standard output accept output redirection. As in the shell, you can write ">outfile" to send the command output to "outfile", and ">>outfile2" to append to outfile2. Output redirections can be anywhere on the line.

There must be whitespace before the "<"/">"/">>" so that the command parser won’t falsely match uses of these characters in regular expressions.

Commands that support output redirection can also be followed by a pipe bar and a normal Unix command. For example, "list | more" directs the output of a list command to more(1). Some whitespace around the pipe bar is required to distinguish it from uses of the same character as the alternation operator in regular expressions.

The command line following the first pipe bar, if present, is passed to a shell and may contain a general shell command line, including more pipe bars. The SHELL environment variable can set the shell, falling back to /bin/sh.

Beware that while the reposurgeon CLI mimics these simple shell features, many things you can do in a real shell won’t work until the right-hand side of a pipe-bar output redirection, if there is one.

You can’t redirect standard error (but see the "log" command for a rough equivalent). And you can’t pipe input from a shell command.

In general you should avoid trying to get cute with the redirection features. The command-line parser is primitive and easily confused.

7.6. Q bits

Each event has an associated "Q" bit that says whether it was touched by the last Q-aware command. Almost all commands that modify the history set Q bits in some interesting way; so do some report generators, such as the "lint" command.

Thus, "=Q count" and "=Q list" can provide you with a check on the scope of modifications while you are trying things out. Or you can use "=Q list" and "=Q list inspect" to zero in on anomalies reported by lint.

8. Import and Export

reposurgeon can hold multiple repository states in memory. Each has a name. At any given time, one may be selected for editing. Commands in this group import repositories, export them, and manipulate the in-core list and the selection.

If you are planning a conversion from Subversion, you should probably read Working with Subversion (svn) after this section.

If you are planning a conversion from Mercurial, out should probably read Working with Mercurial (hg) after this section.

8.1. Reading and writing repositories

read [--quiet] [<INFILE | - | DIRECTORY]

A read command with no arguments is treated as 'read .', operating on the current directory.

With a directory-name argument, this command attempts to read in the contents of a repository in any supported version-control system under that directory.

If input is redirected from a plain file, it will be read in as an import stream (fast-import or Subversion dump), whichever it is.

With an argument of '-', this command reads an import stream from standard input (this will be useful in filters constructed with command-line arguments).

If the content is a fast-import stream, any "cvs-revision" property on a commit is taken to be a newline-separated list of CVS revision cookies pointing to the commit, and used for reference lifting.

If the content is a fast-import stream, any "legacy-id" property on a commit is taken to be a legacy ID token pointing to the commit, and used for reference-lifting.

If the read location is a git repository and contains a .git/cvsauthors file (such as is left in place by "git cvsimport -A") that file will be read in as if it had been given to the "authors read" command.

If the read location is a directory, and its repository subdirectory has a file named legacy-map, that file will be read as though passed to a "legacy read" command.

The just-read-in repo is added to the list of loaded repositories and becomes the current one, selected for surgery. If it was read from a plain file and the file name ends with one of the extensions ".fi" or ".svn", that extension is removed from the load list name.

The "--quiet" option suppresses warnings from the front end used to read in the repository, notably the warning from the CVS reader about missing commit-ids. It’s best to not use this for early testing, adding it only when you’re sure you have a clean read.

This command has a few additional options specific to reading Subversion repositories and stream files; they are described in the manual section on working with Subversion.

[SELECTION] write [--legacy] [--noincremental] [--callout] [>OUTFILE|-|DIRECTORY]

Dump selected events as a fast-import stream representing the edited repository; the default selection set is all events. Where to dump to is standard output if there is no argument or the argument is "-", or the target of an output redirect.

Alternatively, if there is no redirect and the argument names a directory, the repository is rebuilt into that directory, with any selection set being ignored; if that target directory is nonempty its contents are backed up to a save directory.

With the "--legacy" option, the Legacy-ID of each commit is appended to its commit comment at write time. This option is mainly useful for debugging conversion edge cases.

If you specify a partial selection set such that some commits are included but their parents are not, the output will include incremental dump cookies for each branch with an origin outside the selection set, just before the first reference to that branch in a commit. An incremental dump cookie looks like "refs/heads/foo^0" and is a clue to export-stream loaders that the branch should be glued to the tip of a pre-existing branch of the same name. The "--noincremental" option suppresses this behavior.

Specifying a partial selection set, including a commit object, forces the inclusion of every blob to which it refers and every tag that refers to it.

Specifying a partial selection may cause a situation in which some parent marks in merges don’t correspond to commits present in the dump. When this happens and the "--callout" option was specified, the write code replaces the merge mark with a callout, the action stamp of the parent commit; otherwise the parent mark is omitted. Importers will fail when reading a stream dump with callouts; it is intended to be used by the "graft" command.

Specifying a write selection set with gaps in it is allowed but unlikely to lead to good results if it is loaded by an importer.

Property extensions will be be omitted from the output if the importer for the preferred repository type cannot digest them.

Note: to examine small groups of commits without the progress meter, use "list inspect".

8.2. Repository type preference

prefer [VCS-NAME]

Report or set (with argument) the preferred type of repository. With no arguments, describe capabilities of all supported systems. With an argument (which must be the name of a supported version-control system, and tab-completes in that list) this has two effects:

First, if there are multiple repositories in a directory you do a read on, reposurgeon will read the preferred one (otherwise it will complain that it can’t choose among them).

Secondly, this will change reposurgeon’s preferred type for output. This means that you do a write to a directory, it will build a repo of the preferred type rather than its original type (if it had one).

If no preferred type has been explicitly selected, reading in a repository (but not a fast-import stream) will implicitly set reposurgeon’s preference to the type of that repository.

select [VCS-NAME]

Report (with no arguments) or select (with one argument) the current repository’s source type. This type is normally set at repository-read time, but may remain unset if the source was a stream file. The argument tab-completes using the list of supported systems.

The source type affects the recognition of legacy IDs by the the =N visibility selector by controlling the regular expressions used to recognize them. If no preferred output type has been set, it may also control the output format of stream files made from the repository.

The source type is reliably set whenever a live repository is read, or when a Subversion stream is interpreted - but not necessarily by other stream files. Here’s how reposurgeon gathers hints from stream files:

  1. Streams generated by cvs-fast-export(1) using the "--reposurgeon" option are detected as CVS (or perhaps RCS). This is considered a strong hint.

  2. Streams generated by src(1) are identified as src, SCCS, or RCS. This is considered a strong hint.

  3. File basenames that match those used by known version-control systems for storing ignore patterns - e.g. .gitignore indicating Git, .hgignore indicating Mercurial, etc. - are considered weak hints.

  4. Certain magic $-headers in content blobs are considered weak hints. These are associated with SCCS, RCS, and CVS.

    The source type in a stream not from a live repository is set by the first strong hint or the last weak hint. Reposurgeon will issue warnings in the event it sees multiple conflicting strong hints.

8.3. Rebuilds in place

reposurgeon can rebuild an altered repository in place. Untracked files are normally saved and restored when the contents of the new repository are checked out (but see the documentation of the ‘preserve’ command for a caveat).

rebuild [DIRECTORY]

Rebuild a repository from the state held by reposurgeon. This command does not take a selection set.

The single argument, if present, specifies the target directory in which to do the rebuild; if the repository read was from a repo directory (and not a git-import stream), it defaults to that directory. If the target directory is nonempty its contents are backed up to a save directory. Files and directories on the repository’s preservation list are copied back from the backup directory after repo rebuild. The default preserve list depends on the repository type, and can be displayed with the "preserve" command.

If reposurgeon has a nonempty legacy map, it will be written to a file named "legacy-map" in the repository subdirectory as though by a "legacy write" command. (This will normally be the case for Subversion and CVS conversions.)

8.4. Crash recovery

This section will become relevant only if reposurgeon or something underneath it in the software and hardware stack crashes while in the middle of writing out a repository, in particular if the target directory of the rebuild is your current directory.

The tool has two conflicting objectives. On the one hand, we never want to risk clobbering a pre-existing repo. On the other hand, we want to be able to run this tool in a directory with a repo and modify it in place.

We resolve this dilemma by playing a game of three-directory monte.

  1. First, we build the repo in a freshly-created staging directory. If your target directory is named /path/to/foo, the staging directory will be a peer named /path/to/foo-stageNNNN, where NNNN is a cookie derived from reposurgeon’s process ID.

  2. We then make an empty backup directory. This directory will be named /path/to/foo.~N~, where N is incremented so as not to conflict with any existing backup directories. reposurgeon never, under any circumstances, ever deletes a backup directory.

    So far, all operations are safe; the worst that can happen up to this point if the process gets interrupted is that the staging and backup directories get left behind.

  3. The critical region begins. We first move everything in the target directory to the backup directory.

  4. Then we move everything in the staging directory to the target.

  5. We finish off by restoring untracked files in the target directory from the backup directory. That ends the critical region.

During the critical region, all signals that can be ignored are ignored.

8.5. File preservation

When the repository type you are working with has a "lister" method, it can tell which files in a repository directory are not checked in and will copy them into the edited repository made by a rebuild.

The following commands are required only if there is no lister method and you have to set preservations by hand. Under systems with such a command (which include git and hg), all files that are neither beneath the repository dot directory nor under reposurgeon temporary directories are preserved automatically.

preserve [PATH…​]

Add (presumably untracked) files or directories to the repo’s list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. Pathname arguments may be bare tokens or double-quoted strings, which may contain whitespace; the double quotes are stripped before interpretation. The current preserve list is displayed afterwards.

This command is included for completeness, but most version-control systems (and all those that reposurgeon can rebuild) have a path-list list and that makes it unnecessary. The path-list command is used with a sweep for all files existing in the repository directory to identify everything that should be preserved.

unpreserve [PATH…​]

Remove (presumably untracked) files or directories to the repo’s list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. Pathname arguments may be bare tokens or double-quoted strings, which may contain whitespace; the double quotes are stripped before interpretation. The current preserve list is displayed afterwards.

See the documentation of the "preserve" command for why this command is almost never necessary.

8.6. Incorporating data

When converting a legacy repository, it sometimes happens that there are archived releases of the project surviving from before the date of the repository’s initial commit. It may be desirable to insert those releases at the front of the repository history. Do do this, use this command:

  • No help on import

8.7. The repository list

Reposurgeon can have several repositories loaded at once. The following commands operate on the repository list.

choose [REPO-NAME]

Choose a named repo on which to operate. The name of a repo is normally the basename of the directory or file it was loaded from, but repos loaded from standard input are 'unnamed'. The program will add a disambiguating suffix if there have been multiple reads from the same source.

With no argument, lists the names of the currently stored repositories. The second column is '*' for the currently selected repository, '-' for others.

With an argument, the command tab-completes on the above list.

drop [REPO-NAME]

Drop a repo named by the argument from reposurgeon’s list, freeing the memory used for its metadata and deleting on-disk blobs. With no argument, drops the currently chosen repo. Tab-completes on the list of loaded repositories.

clone

Clone the in-memory representation of the selected repository. All metadata is copied. Any blobs on disk are shared until modified. The name of the clone gets the added suffix "clone". The clone is selected. Q bits in the clone are cleared.

Useful if you need to set up for expunge commands to partition a repository by cliques of filepaths.

The "rename repo" mode of help can be used to rename a repository.

9. Information and reports

Commands in this group report information about the selected repository.

The output of these commands can individually be redirected to a named output file. Where indicated in the syntax, you can prefix the output filename with ‘>’ and give it as a following argument. If you use ‘>>’ the file is opened for append rather than write.

9.1. Reports on the DAG

[SELECTION] list [--decode=CODEC] [commits|index|inspect|manifest|names|paths|sizes|stamps|stats|tags] [PATTERN] [>OUTFILE]

Requires a loaded repository. Takes a selection set, defaulting to all

With "commits" or no subcommand, display selected commits in a human-friendly format; the first column is raw event numbers, the second a timestamp in UTC. If the repository has legacy IDs, they will be displayed in the third column. The leading portion of the comment follows.

With "index", display four columns of info on selected events: their number, their type, the associated mark (or '-' if no mark) and a summary field varying by type. For a branch or tag it’s the reference; for a commit it’s the commit branch; for a blob it’s a space-separated list of the repository path of the files with the blob as content.

With "inspect", dump a fast-import stream representing selected events to standard output. Just like a write, except (1) the progress meter is disabled, and (2) there is an identifying header before each event dump.

With "manifest", print commit path lists. Takes an optional pattern expression. For each selected commit, print the mapping of all paths in that commit tree to the corresponding blob marks, mirroring what files would be created in a checkout of the commit. If a regular expression PATTERN is given, only print "path → mark" lines for paths matching it. See "help regexp" for more information about regular expressions.

With "names", list all known symbolic names of branches, and of tags in the selection set. Tells you what things are legal within angle brackets and parentheses.

With "paths", list all paths touched by fileops on selected commits.

With "sizes", report on data volume per branch. The numbers tally the size of selected uncompressed blobs, commit and tag comments, and other metadata strings (a blob is counted each time a commit points at it). Not an exact measure of storage size: intended mainly as a way to get information on how to efficiently partition a repository that has become large enough to be unwieldy.

With "stamps", display full action stamps corresponding to selected commits. The stamp is followed by the first line of the commit message.

With "stats", report counts of selected objects.

With "tags", display selected tags of both kinds - annotated and resets in the tags namespace. Three fields, an event number and a type and a name. Branch tip commits associated with tags are also displayed with the type field 'commit'.

With the --decode option, the CODEC argument must name one of the codecs known to the Go standard codecs library; see the documentation of the transcode command for details. Transcode the output to UTF-8 using the specified codec. Transcoding errors abort the command.

Any list command can be safely interrupted with ^C, returning you to the prompt.

[SELECTION] graph [>OUTFILE]

Emit a visualization of the commit graph in the DOT markup language used by the graphviz tool suite. This can be fed as input to the main graphviz rendering program dot(1), which will yield a viewable image.

Because graph supports output redirection, you can do this:

graph | dot -Tpng | display

You can substitute in your own preferred image viewer, of course.

view [repodir]

With an argument directory that is a live repository, browse the repository using whatever native GUI tool may be appropriate for the version-control system managing that repository.

Without an argument directory, build a live Git repository from the state of the currently selected repository to a temporary directory, then browse that with gitk; afterwards, delete the temporary directory. Because it requires a rebuild, this command can be laggy on large histories.

In both cases, timestamps are displayed in UTC - not local time - to match reposurgeon’s timestamp syntax.

[SELECTION] lint [--OPTION…​] [>OUTFILE]

Look for DAG and metadata configurations that may indicate a problem. Presently can check for: (1) Mid-branch deletes, (2) disconnected commits, (3) parentless commits, (4) the existence of multiple roots, (5) committer and author IDs that don’t look well-formed as DVCS IDs, (6) multiple child links with identical branch labels descending from the same commit, (7) time and action-stamp collisions.

The options and output format of this command are unstable; they may change without notice as more sanity checks are added.

This command sets Q bits; true where a potential problem was reported, false otherwise.

Options to issue only partial reports are supported:

 --deletealls    --d     report mid-branch deletealls
 --connected     --c     report disconnected commits
 --roots         --r     report on multiple roots
 --attributions  --a     report on anomalies in usernames and attributions
 --uniqueness    --u     report on collisions among action stamps
 --cvsignores    --i     report if .cvsignore files are present
SELECTION count [>OUTFILE]

Report a count of items in the selection set. Default set is everything in the currently-selected repo.

9.2. Examining tree states

SELECTION checkout

Check out files for a specified commit into a directory. The selection set must resolve to a singleton commit.

SELECTION diff [>OUTFILE]

Display the difference between commits. Takes a selection-set argument which must resolve to exactly two commits.

10. Surgical Operations

These are the operations the rest of reposurgeon is designed to support.

10.1. Creations, deletions, and renames

[SELECTION] create {{repo|blob|tag|reset} NAME | blob NAME [<INFILE]}

With "repo", create an empty repository with a specified name in memory. The new repository becomes chosen. It has no events and no source type or preferred type. This command can be used to begin scripted creation of a repository from scratch with import, create blob, msgin --create, reset create, and tag create commands.

With "blob", create a blob with the specified mark name, which must not already exist. The new blob is inserted at the front of the repository event sequence, after options but before previously-existing blobs. The blob data is taken from standard input, which may be a redirect from a file or a here-doc. This command can be used with the add command to patch new data into a repository.

With "tag", creates an annotated tag. First argument is NAME, which must not be an existing tag. Takes a singleton selection set which must point to a commit; the default is the last commit, e.g. @max(=C). A tag event pointing to the commit is created and inserted just after the last tag in the repo (or just after the last commit if there are no tags). The tagger, committish, and comment fields are copied from the commit’s committer, mark, and comment fields. The timestamp is incremented by a second for uniqueness.

With "reset" requires a singleton selection which is the associated commit for the reset, takes as a first argument the name of the reset (which must not exist), and ends with the keyword create. In this case the name must be fully qualified, with a refs/heads/ or refs/tags/ prefix. Note: While this command is provided for the sake of completeness, think twice before actually using it. Normally a reset should only be deleted or renamed when its associated branch is, and the branch command does this.

When creating blobs, tags, or resets, all Q bits are cleared; then any objects created get their Q bit set.

[SELECTION] squash [POLICY…​]

Combine or delete commits in a selection set of events. The default selection set for this command is empty. Has no effect on events other than commits unless the --delete policy is selected; see the ‘delete’ command for discussion.

Normally, when a commit is squashed, its file operation list (and any associated blob references) gets either prepended to the beginning of the operation list of each of the commit’s children or appended to the operation list of each of the commit’s parents. Then children of a deleted commit get it removed from their parent set and its parents added to their parent set.

The analogous operation is performed on commit comments, so no comment text is ever outright discarded. Exception: comments consisting of “*** empty log message ***”, as generated by CVS, are ignored.

The default is to squash forward, modifying children; but see the list of policy modifiers below for how to change this.

Warning
It is easy to get the bounds of a squash command wrong, with confusing and destructive results. Beware thinking you can squash on a selection set to merge all commits except the last one into the last one; what you will actually do is to merge all of them to the first commit after the selected set.

Normally, any tag pointing to a combined commit will also be pushed forward. But see the list of policy modifiers below for how to change this.

Following all operation moves, every one of the altered file operation lists is reduced to a shortest normalized form. The normalized form detects various combinations of modification, deletion, and renaming and simplifies the operation sequence as much as it can without losing any information.

The following modifiers change these policies:

--delete

Simply discards all file ops and tags associated with deleted commit(s).

--no-coalesce

Do not normalize the modified commit operations.

--pushback

Append fileops to parents, rather than prepending to children.

--pushforward

Prepend fileops to children. This is the default; it can be specified in a lift script for explicitness about intentions.

--tagforward

Any tag on the deleted commit is pushed forward to the first child rather than being deleted. This is the default; it can be specified for explicitness.

--tagback

Any tag on the deleted commit is pushed backward to the first parent rather than being deleted.

--quiet

Suppresses warning messages about deletion of commits with non-delete fileops.

--complain

The opposite of --quiet. Can be specified for explicitness.

--empty-only

Complain if a squash operation modifies a nonempty comment.

--blobs

Allow deletion of selected blobs.

Under any of these policies except --delete, deleting a commit that has children does not back out the changes made by that commit, as they will still be present in the blobs attached to versions past the end of the deletion set. All a delete does when the commit has children is lose the metadata information about when and by who those changes were actually made; after the delete any such changes will be attributed to the first undeleted children of the deleted commits. It is expected that this command will be useful mainly for removing commits mechanically generated by repository converters such as cvs2svn.

{SELECTION} delete [--quiet] {commit | {path|tag|branch|reset} [--not] PATTERN}

With "commit" or no subcommand, delete a selection set of events. Requires an explicit selection set. Tags, resets, and passthroughs are deleted with no side effects. Blobs cannot be directly deleted with this command; they are removed only when removal of fileops associated with commits requires this. A commit delete is equivalent to a squash with the --delete flag.

All other subcommands require a selected repository and a BRANCH-PATTERN argument which is a pattern expression; with the option --not, invert the match.

With "branch", if the pattern does not begin with "refs/", that is prepended. Matching branches are deleted. Associated tags and resets are also deleted.

With "path", expunge files from the selected portion of the repo history; the default is the entire history. The argument to this command is a pattern expression matching paths. If the pattern is enclosed by double quotes it may contain spaces; the double quotes are stripped off before it is interpreted as a delimited regexp or literal string. The option --not inverts this; all file paths other than those selected by the remaining arguments to be expunged. You may use this to sift out all file operations matching a pattern set rather than expunging them.

With "reset", all matching resets are deleted. If RESET-PATTERN is a text literal, each reset’s name is matched if RESET-PATTERN is either the entire reference (refs/heads/FOO or refs/tags/FOO for some some value of FOO) or the basename (e.g. FOO), or a suffix of the form heads/FOO or tags/FOO. An unqualified basename is assumed to refer to a branch in refs/heads/. When a reset is deleted, matching branch fields are changed to match the branch of the unique descendant of the tip commit of the associated branch, if there is one.

With "tag", requires a TAG-PATTERN argument that is a pattern expression matching a set of annotated tags. Matching tags are deleted. Giving a regular expression rather than a plain string is useful for mass deletion of junk tags such as those derived from CVS branch-root tags. The option "--not" takes the complement of the set of tags implied by the TAG-PATTERN. Deletions can be restricted by a selection set in the normal way.

All filemodify (M) operations and delete (D) operations involving a matched file in the selected set of events are disconnected from the repo and put in a removal set. Renames are followed as the tool walks forward in the selection set; each triggers a warning message. If a selected file is a copy © target, the copy will be deleted and a warning message issued. If a selected file is a copy source, the copy target will be added to the list of paths to be deleted and a warning issued.

After file expunges have been performed, any commits with no remaining file operations will be deleted, and any tags pointing to them. By default each deleted commit is replaced with a tag of the form emptycommit-<ident> on the preceding commit unless the --notagify option is specified. Commits with deleted fileops pointing both in and outside the path set are not deleted.

This command clears all Q bits. The "path" mode then sets true on any commit which lost fileops but was not entirely deleted.

[SELECTION] rename {repo | path PATTERN [--force] | {path|branch|tag|reset} [--not] PATTERN}} NEW-NAME

With "repo", renames the currently chosen repo; requires a NEW-NAME argument. Won’t do it if there is already one by the new name.

Other subcommands require a PATTERN which is a pattern expression. NEW-NAME may contain back-reference syntax (${1} etc.). See "help regexp" for more information about regular expressions. If PATTERN or NEW-NAME are wrapped by double quotes they may contain whitespace; the quotes are stripped before further interpretation as a delimited regexp or literal string. The --not option inverts the selection for renaming

With "path", rename a path in every fileop of every selected commit. The default selection set is all commits. The pattern expression to matched against paths; Ordinarily, if the target path already exists in the fileops, or is visible in the ancestry of the commit, this command throws an error. With the --force option, these checks are skipped.

With "rename", rename objects that match by name.

Renaming branches also operates on any associated annotated tags and resets. Bear in mind that a Git lightweight tag here is simply a branch in the tags/ namespace.

In a branch rename, the third argument may be any token that is a syntactically valid branch name (but not the name of an existing branch). If it does not begin with "refs/", then "refs/" is prepended; you should supply "heads/" or "tags/" yourself. You cannot rename a branch to the name of an existing branch unless they are joined root to tip, making the operation effectively a merge.

Branch rename has some special behavior when the repository source type is Subversion. It recognizes tags and resets made from branch-copy commits and transforms their names as though they were branch fields in commits.

When a reset is renamed, commit branch fields matching the tag are renamed with it to match.

Rename sets Q bits; true on every object modified, false otherwise.

10.2. Commit mutation

SELECTION merge

Create a merge link. Takes a selection set argument, ignoring all but the lowest (source) and highest (target) members. Creates a merge link from the highest member (child) to the lowest (parent).

This command will throw an error if you try to make a merge link to a parentless (e.g. root) commit, as that would produce an invalid fast-import stream.

If the command succeeds, all Q bits are cleared, then the Q bits of the two commits are set.

SELECTION unmerge

Linearizes a commit. Takes a selection set argument, which must resolve to a single commit, and removes all its parents except for the first. It is equivalent to reparent --rebase {first parent},{commit}, where {commit} is the selection set given to unmerge and {first parent} is a set resolving to that commit’s first parent, but doesn’t need you to find the first parent yourself, saving time and avoiding errors when nearby surgery would make a manual first parent argument stale.

If the command succeeds, all Q bits are cleared, then the Q bits of the unmerged commit is set.

SELECTION reparent [--use-order] [--rebase]

Changes the parent list of a commit. Takes a selection set, zero or more option arguments, and an optional policy argument.

The selection set must resolve to one or more commits. The selected commit with the highest event number (not necessarily the last one selected) is the commit to modify. The remainder of the selected commits, if any, become its parents: the selected commit with the lowest event number (which is not necessarily the first one selected) becomes the first parent, the selected commit with second lowest event number becomes the second parent, and so on. All original parent links are removed. Examples:

# this makes 17 the parent of 33
17,33 reparent
+
# this also makes 17 the parent of 33
33,17 reparent
+
# this makes 33 a root (parentless) commit
33 reparent
+
# this makes 33 an octopus merge commit.  its first parent
# is commit 15, second parent is 17, and third parent is 22
22,33,15,17 reparent

With --use-order, use the selection order to determine which selected commit is the commit to modify and which are the parents (and if there are multiple parents, their order). The last selected commit (not necessarily the one with the highest event number) is the commit to modify, the first selected commit (not necessarily the one with the lowest event number) becomes the first parent, the second selected commit becomes the second parent, and so on. Examples:

# this makes 33 the parent of 17
33,17 reparent --use-order
+
# this makes 17 an octopus merge commit.  its first parent
# is commit 22, second parent is 33, and third parent is 15
22,33,15,17 reparent --use-order

Because ancestor commit events must appear before their descendants, giving a commit with a low event number a parent with a high event number triggers a re-sort of the events. A re-sort assigns different event numbers to some or all of the events. Re-sorting only works if the reparenting does not introduce any cycles. To swap the order of two commits that have an ancestor-descendant relationship without introducing a cycle during the process, you must reparent the descendant commit first.

With "--rebase", change the way the manifest of the reparented commit is generated. By default, the manifest of the reparented commit is computed before modifying it; a "deleteall" and some fileops are prepended so that the manifest stays unchanged even when the first parent has been changed. The --rebase flag inhibits the default behavior — no 'deleteall' is issued and the tree contents of all descendants can be modified as a result.

[SELECTION] split [ --path ] PATH-OR-INDEX

Split a specified commit in two, the opposite of squash.

The selection set is required to be a commit location; the required argument identifies a fileop. If it is numeric, it is intepreted as an integer 1-origin index of a file operation within the commit. If not, it must be a pathame to match. The option --path forces the pathname interpretation.

The commit is copied and inserted into a new position in the event sequence, immediately following itself; the duplicate becomes the child of the original, and replaces it as parent of the original’s children. Commit metadata is duplicated; the mark of the new commit is then changed. If the new commit has a legacy ID, the suffix '.split' is appended to it.

Finally, some file operations - starting at the one matched or indexed by an index argument - are moved forward from the original commit into the new one. Legal indices are 2-n, where n is the number of file operations in the original commit.

Sets Q bits on the split commits; clears all others.

SELECTION add { "D" PATH | "M" PERM {MARK|SHA1} PATH | "R" SOURCE TARGET | "C" SOURCE TARGET }

PATH, SOURCE and TARGET arguments may be double-quoted strings containing whitespace.

In specified commits, add a specified fileop. The selection set does not have to be a singleton, but typically will be.

For a D operation to be valid there must be an M operation for the path in the commit’s ancestry.

For an M operation to be valid, PERM must either be a token ending with 755 or 644 indicating a normal file permission value, or one of the special values 120000 or 160000.

If PERM is a normal file permission value or 120000, it must be followed by a MARK field referring to a blob that precedes the commit location. If the MARK is nonexistent or names something other than a blob, attempting to rebuild a live repository will throw a fatal error.

if PERM is 160000, the third field is assumed to be a hash value and not checked, as it is expected to refer to a Git submodule link.

For an R or C operation to be valid, there must be an M operation for the SOURCE path in the commit’s ancestry.

Clears Q bits, then sets the Q bit of every commit to which a fileop is added.

Some examples:

# At commit :15, stop .gitignore from being checked out in later revisions
:15 add D .gitignore
+
# Create a new blob :2317 with specified content. At commit :17, add modify
# or creation of a file named "captain" with its content in the new blob.
# Make it check out with 755 (-rwxr-xr-x) permissions rather than the
# normal 644 (-rw-r--r--).
blob :2317 <<EOF
Hello, I must be going.
EOF
:17 add M 100755 :2317 captain
[SELECTION] remove {INDEX | ["D"|"M"|"R"|"C"|"N"] [PATH]} [to TARGET]

From a specified commit, removes all fileops matching the given selector(s). Selectors may consist of: a 1-origin index, or an optional fileop type and an optional path.

If the to clause is present, the removed op is appended to the commit specified by the following singleton selection set.

Sets Q bits: true for each commit modified and blob with altered references, false otherwise.

Examples:

# From the commit at :423, remove any fileop referencing foobar.txt
:423 remove foobar.txt
+
# From the commit at :423, remove the second fileop.
:423 remove 2
+
# From the commit at :423, remove all deletes
:423 remove D
[SELECTION] tagify [ --tagify-merges | --canonicalize | --tipdeletes ]

Search for empty commits and turn them into tags. May be useful in cleaning up Subversion conversions that had previously been lifted with cvs2svn.

Takes an optional selection set argument defaulting to all commits. For each commit in the selection set, turn it into a tag with the same message and author information if it has no fileops. By default merge commits are not considered, even if they have no fileops (thus no tree differences with their first parent). To change that, see the '--tagify-merges' option.

The name of the generated tag will be 'emptycommit-<ident>', where <ident> is generated from the legacy ID of the deleted commit, or from its mark, or from its index in the repository, with a disambiguation suffix if needed.

tagify currently recognizes three options: first is '--canonicalize' which makes tagify try harder to detect trivial commits by first removing all fileops of the selected commits which have no actual effect when processed by fast-import. For example, file modification ops that don’t actually change the content of the file, or deletion ops that delete a file that doesn’t exist in the parent commit get removed. This rarely happens naturally, but can happen after some surgical operations, such as reparenting.

The second option is '--tipdeletes' which makes tagify also consider branch tips with only deleteall fileops to be candidates for tagification. The corresponding tags get names of the form 'tipdelete-<branchname>' rather than the default 'emptycommit-<ident>'.

The third option is '--tagify-merges' that makes reposurgeon also tagify merge commits that have no fileops. When this is done the merge link is moved to the tagified commit’s parent.

This command cleaes all Q bits, then seta the Q bits of all tags it creates.

[SELECTION] reorder [--quiet]

Re-order a contiguous range of commits.

Older revision control systems tracked change history on a per-file basis, rather than as a series of atomic "changesets", which often made it difficult to determine the relationships between changes. Some tools which convert a history from one revision control system to another attempt to infer changesets by comparing file commit comment and time-stamp against those of other nearby commits, but such inference is a heuristic and can easily fail.

In the best case, when inference fails, a range of commits in the resulting conversion which should have been coalesced into a single changeset instead end up as a contiguous range of separate commits. This situation typically can be repaired easily enough with the 'coalesce' or 'squash' commands. However, in the worst case, numerous commits from several different "topics", each of which should have been one or more distinct changesets, may end up interleaved in an apparently chaotic fashion. To deal with such cases, the commits need to be re-ordered, so that those pertaining to each particular topic are clumped together, and then possibly squashed into one or more changesets pertaining to each topic. This command, 'reorder', can help with the first task; the 'squash' command with the second.

Selected commits are re-arranged in the order specified; for instance: ":7,:5,:9,:3 reorder". The specified commit range must be contiguous; each commit must be accounted for after re-ordering. Thus, for example, ':5' can not be omitted from ":7,:5,:9,:3 reorder". (To drop a commit, use the 'delete' or 'squash' command.) The selected commits must represent a linear history, however, the lowest numbered commit being re-ordered may have multiple parents, and the highest numbered may have multiple children.

Re-ordered commits and their immediate descendants are inspected for elementary fileops inconsistencies. Warns if re-ordering results in a commit trying to delete, rename, or copy a file before it was ever created. Likewise, warns if all of a commit’s fileops become no-ops after re-ordering. Other fileops inconsistencies may arise from re-ordering, both within the range of affected commits and beyond; for instance, moving a commit which renames a file ahead of a commit which references the original name. Such anomalies can be discovered via manual inspection and repaired with the 'add' and 'remove' (and possibly 'path') commands. Warnings can be suppressed with '--quiet'.

In addition to adjusting their parent/child relationships, re-ordering commits also re-orders the underlying events since ancestors must appear before descendants, and blobs must appear before commits which reference them. This means that events within the specified range will have different event numbers after the operation.

10.3. Advanced branch operations

branchlift SOURCEBRANCH PATHPREFIX [NEWNAME]

Every commit on SOURCEBRANCH with fileops matching the PATHPREFIX is examined; all commits with every fileop matching the PATH are moved to a new branch; if a commit has only some matching fileops it is split and the fragment containing the matching fileops is moved.

Every matching commit is modified to have the branch label specified by NEWNAME. If NEWNAME is not specified, the basename of PATHPREFIX is used. If the resulting branch already exists, this command errors out without modifying the repository.

The PATHPREFIX is removed from the paths of all fileops in modified commits.

All three names may be bare tokens or double-quoted strings.

Sets Q bits: commits on the source branch modified by having fileops lifted to the new branch true, all others false.

debranch SOURCE-BRANCH [TARGET-BRANCH]

Takes one or two arguments which must be the names of source and target branches; if the second (target) argument is omitted it defaults to 'master'. The history of the source branch is merged into the history of the target branch, becoming the history of a subdirectory with the name of the source branch. Any trailing segment of a branch name is accepted as a synonym for it; thus 'master' is the same as 'refs/heads/master'. Any resets of the source branch are removed.

Clears all Q bits, then sets the Q bit of every commit that has its branch field modified.

10.4. Repository splitting and merging

SELECTION divide

Attempt to partition a repo by cutting the parent-child link between two specified commits (they must be adjacent). Does not take a general selection-set argument. It is only necessary to specify the parent commit, unless it has multiple children in which case the child commit must follow (separate it with a comma).

If the repo was named 'foo', you will normally end up with two repos named 'foo-early' and 'foo-late'. But if the commit graph would remain connected through another path after the cut, the behavior changes. In this case, if the parent and child were on the same branch 'qux', the branch segments are renamed 'qux-early' and 'qux-late', but the repo is not divided.

unite [--prune] [REPO-NAME…​]

Unite named repositories into one. Repos need to be loaded (read) first. They will be processed and removed from the load list. The union repo will be selected.

All repos are grafted as branches to the oldest repo. The branch point will be the last commit in that repo with a timestamp that is less or equal to the earliest commit on a grafted branch.

In all repositories but the first, tag and branch duplicate names will be disambiguated using the source repository name. After all grafts, marks will be renumbered.

The name of the new repo is composed from names of united repos joined by '+'. It will have no source directory. The type of repo will be inherited if all repos share the same type, otherwise no type will be set.

With the option --prune, at each join generate D ops for every file that doesn’t have a modify operation in the root commit of the branch being grafted on.

[SELECTION] graft [--prune] REPO-NAME

For when unite doesn’t give you enough control. This command may have either of two forms, distinguished by the size of the selection set. The first argument is always required to be the name of a loaded repo.

If the selection set is of size 1, it must identify a single commit in the currently chosen repo; in this case the named repo’s root will become a child of the specified commit. If the selection set is empty, the named repo must contain one or more callouts matching commits in the currently chosen repo.

Labels and branches in the named repo are prefixed with its name; then it is grafted to the selected one. Any other callouts in the named repo are also resolved in the control of the currently chosen one. Finally, the named repo is removed from the load list.

With the option --prune, prepend a deleteall operation into the root of the grafted repository.

10.5. Metadata editing

[SELECTION] msgout [--id] [--filter=PATTERN] [--decode=CODEC] [--blobs]

Emit a file of messages in Internet Message Format representing the contents of repository metadata. Takes a selection set; members of the set other than commits, annotated tags, and passthroughs are ignored (that is, presently, blobs and resets).

May have an option --filter, followed by a pattern expression (unanchored matching). If this is given, only headers with names matching it are emitted. In this control the name of the header includes its trailing colon. The value of the option must be a pattern expression. See "help regexp" for information on the regexp syntax.

Blobs may be included in the output with the option --blobs.

With the --decode option, the CODEC argument must name one of the codecs known to the Go standard codecs library; see the documentation of the transcode command for details. Transcode the output to UTF-8 using the specified codec. Transcoding errors abort the command.

The following example produces a mailbox of commit comments in a decluttered form that is convenient for editing:

=C msgout --filter="/Committer:|Committer-Date:|Check-Text:/"

This is the filter set by the --id option.

This command can be safely interrupted with ^C, returning you to the prompt.

[SELECTION] msgin [--create] [--empty-only] [--relax] [<INFILE]

Accept a file of messages in Internet Message Format representing the contents of the metadata in selected commits and annotated tags. If there is an argument, it will be taken as the name of a message-box file to read from; if no argument, or one of '-', reads from standard input. Supports < redirection. Ordinarily takes no selection set.

Users should be aware that modifying an Event-Number or Event-Mark field will change which event the update from that message is applied to. This is unlikely to have good results.

The header CheckText, if present, is examined to see if the comment text of the associated event begins with it. If not, the item modification is aborted. This helps ensure that you are landing updates on the events you intend.

If the --create modifier is present, new tags and commits will be appended to the repository. In this case it is an error for a tag name to match any existing tag name. Commit events are created with no fileops. If Committer-Date or Tagger-Date fields are not present they are filled in with the time at which this command is executed. If Committer or Tagger fields are not present, reposurgeon will attempt to deduce the user’s git-style identity and fill it in. If a singleton commit set was specified for commit creations, the new commits are made children of that commit.

If the --create modifier is present and a commit-creation block has a Content-Path headers, the header is interpreted as a file path to be appended to the commit and an appropriate blob is prepended containing the file contents. Fileop permissions are set depending on the file’s executable bit. If there is a Content-Name header it overrides the path put in the fileop.

Otherwise, if the Event-Number and Event-Mark fields are absent, the msgin logic will attempt to match the commit or tag first by Legacy-ID, then by a unique committer ID and timestamp pair.

If the option --empty-only is given, this command will throw a recoverable error if it tries to alter a message body that is neither empty nor consists of the CVS empty-comment marker.

The --relax option suppresses warnings about message blocks not matching any object, but leaves fatal errors due to ill-formed mailbox elements and multiple matches unsuppressed.

This operation sets Q bits; true where an object was modified by it, false otherwise.

[SELECTION] setfield FIELD VALUE

The FIELD and VALUE arguments can be double-quoted strings containing whitespace. C-style backslash escapes are interpreted in VALUE.

In the selected events (defaulting to none) set every instance of a named field to a string value. The value field may be quoted to include whitespace, and use backslash escapes interpreted by Go’s C-like string-escape codec, such as \s.

Attempts to set nonexistent attributes are ignored. Valid values for the attribute are internal field names; in particular, for commits, 'comment' and 'branch' are legal. Consult the source code for other interesting values.

The special fieldnames 'author', 'commitdate' and 'authdate' apply only to commits in the range. The latter two set attribution dates. The former sets the author’s name and email address (assuming the value can be parsed for both), copying the committer timestamp. The author’s timezone may be deduced from the email address.

Clears all Q bits, then sets them only on events that are actually modified.

[SELECTION] attribute [ATTR-SELECTION] {show|set|delete|prepend|append} [ARG…​]

Inspect, modify, add, and remove commit and tag attributions.

Attributions upon which to operate are selected in much the same way as events are selected, as described in Selection syntax. ATTR_SELECTION is an expression composed of 1-origin attribution-sequence numbers, ‘$’ for last attribution, ‘..’ ranges, comma-separated items, ‘(…​)’ grouping, set operations ‘|’ union, ‘&’ intersection, and ‘~’ negation, and function calls @min(), @max(), @amp(), @pre(), @suc(), @srt(). Attributions can also be selected by visibility set ‘=C’ for committers, ‘=A’ for authors, and ‘=T’ for taggers. Finally, ‘/regexp/’ will attempt to match the regular expression regexp against an attribution name and email address; ‘/n’ limits the match to only the name, and ‘/e’ to only the email address.

With the exception of ‘show’, all actions require an explicit event selection upon which to operate. Available actions are:

[ATTR-SELECTION] show [>OUTFILE]

Inspect the selected attributions of the specified events (commits and tags). The ‘show’ keyword is optional. If no attribution selection expression is given, defaults to all attributions. If no event selection is specified, defaults to all events. Supports > redirection.

{ATTR-SELECTION} set [NAME] [EMAIL] [DATE]

Assign NAME, EMAIL, DATE to the selected attributions. As a convenience, if only some fields need to be changed, the others can be omitted. Arguments NAME, EMAIL, and DATE can be given in any order.

[ATTR-SELECTION] delete

Delete the selected attributions. As a convenience, deletes all authors if ATTR-SELECTION is not given. It is an error to delete the mandatory committer and tagger attributions of commit and tag events, respectively.

[ATTR-SELECTION] prepend [NAME] [EMAIL] [DATE]

Insert a new attribution before the first attribution named by ATTR_SELECTION. The new attribution has the same type (committer, author, or tagger) as the one before which it is being inserted. Arguments NAME, EMAIL, and DATE can be given in any order.

If NAME is omitted, an attempt is made to infer it from EMAIL by trying to match EMAIL against an existing attribution of the event, with preference given to the attribution before which the new attribution is being inserted. Similarly, EMAIL is inferred from an existing matching NAME. Likewise, for DATE.

As a convenience, if ATTR-SELECTION is empty or not specified a new author is prepended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use "setfield" instead.

[ATTR-SELECTION] append [NAME] [EMAIL] [DATE]

Insert a new attribution after the last attribution named by ATTR_SELECTION. The new attribution has the same type (committer, author, or tagger) as the one after which it is being inserted. Arguments NAME, EMAIL, and DATE can be given in any order.

If NAME is omitted, an attempt is made to infer it from EMAIL by trying to match EMAIL against an existing attribution of the event, with preference given to the attribution after which the new attribution is being inserted. Similarly, EMAIL is inferred from an existing matching NAME. Likewise, for DATE.

As a convenience, if ATTR-SELECTION is empty or not specified a new author is appended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use "setfield" instead.

SELECTION prepend [--lstrip] [--legacy] TEXT

Prepend text to the comments of commits and tags in the specified selection set. The text is the first token of the command and may be a double-quoted string containing whitespace. C-style escape sequences in TEXT are interpreted.

If the option --lstrip is given, the comment is left-stripped before the new text is prepended. If the option --legacy is given, the string %LEGACY% in the prepend payload is replaced with the commit’s legacy-ID before it is prepended.

Sets Q bits: true for each commit and tag modified, false otherwise.

Example:

=C prepend --legacy "Legacy-Id: %%LEGACY%%\n"
SELECTION append [--rstrip] [--legacy] TEXT

Append text to the comments of commits and tags in the specified selection set. The text is the first token of the command and may be a double-quoted string containing whitespace. C-style escape sequences in TEXT are interpreted.

If the option --rstrip is given, the comment is right-stripped before the new text is appended. If the option --legacy is given, the string %LEGACY% in the append payload is replaced with the commit’s legacy-ID before it is appended.

Sets Q bits: true for each commit and tag modified, false otherwise.

Example:

=C append --legacy "\nLegacy-Id: %LEGACY%"
[SELECTION] gitify

Attempt to massage comments into a git-friendly form with a blank separator line after a summary line. This code assumes it can insert a blank line if the first line of the comment ends with '.', ',', ':', ';', '?', or '!'. If the separator line is already present, the comment won’t be touched.

Takes a selection set, defaulting to all commits and tags.

Sets Q bits: true for each commit and tag with a comment modified by this command, false on all other events.

[SELECTION] filter {dedos|shell|regex|replace} [TEXT-OR-RE]

Run blobs, commit comments and committer/author names, or tag comments and tag committer names in the selection set through the filter specified on the command line.

With any verb other than dedos, attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

When filtering blobs, if the command line contains the magic cookie '%PATHS%' it is replaced with a space-separated list of all paths that reference the blob.

With the verb shell, the remainder of the line specifies a filter as a shell command - reposurgeon does not interpret double quotes there, passing them to the shell. Each blob or comment is presented to the filter on standard input; the content is replaced with whatever the filter emits to standard output.

With the verb regex, the remainder of the line is expected to be a Go regular expression substitution written as /from/to/ with C-like backslash escapes interpreted in 'to'. Any punctuation character will work as a delimiter in place of the /; this makes it easier to use / in patterns. Ordinarily only the first such substitution is performed; putting 'g' after the slash replaces globally, and a numeric literal gives the maximum number of substitutions to perform. Other flags available restrict substitution scope - 'c' for comment text only, 'C' for committer name only, 'a' for author names only.

With the verb replace, the behavior is like regex but the expressions are not interpreted as regular expressions. (This is slightly faster).

With the verb dedos, DOS/Windows-style \r\n line terminators are replaced with \n.

All variants of this command set Q bits; events actually modified by the command get true, all other events get false

Some examples:

# In all blobs, expand tabs to 8-space tab stops
=B filter shell expand --tabs=8
+
# Text replacement in comments
=C filter replace /Telperion/Laurelin/c
+
# Specifications with embedded spaces must be quoted
=C filter replace "/Elendil/Ar-Pharazon the Golden/"

10.6. Path reports and modifications

SELECTION setperm PERM [PATH…​]

The PERM and PATH arguments can be double-quoted strings containing whitespace. This is only likely to be useful on PATH.

For the selected events (defaulting to none) take the first argument as an octal literal describing permissions. All subsequent arguments are paths. For each M fileop in the selection set and exactly matching one of the paths, patch the permission field to the first argument value.

Sets Q bits: true if a commit was actually modified by this operation, false otherwise.

10.7. Timequakes and time offsets

Modifying a repository so every commit in it has a unique timestamp is often a useful thing to do, in order for every commit to have a unique action stamp that can be referred to in surgical commands.

The ‘lint’ command will tell you if you have timestamp collisions.

[SELECTION] timequake [--tick]

Attempt to hack committer and author time stamps to make all action stamps in the selection set (defaulting to all commits in the repository) to be unique. Works by identifying collisions between parent and child, than incrementing child timestamps so they no longer coincide. Won’t touch commits with multiple parents.

Because commits are checked in ascending order, this logic will normally do the right thing on chains of three or more commits with identical timestamps.

Any collisions left after this operation are probably cross-branch and have to be individually dealt with using 'timeoffset' commands.

The normal use case for this command is early in converting CVS or Subversion repositories, to ensure that the surgical language can count on having a unique action-stamp ID for each commit.

This command sets Q bits: true on each event with a timestamp bumped, false on all other events.

With --tick, instead set all commit and tag timestamps in accordance with a monotonic clock that ticks once per repository object in sequence.

[SELECTION] timeoffset OFFSET [TIMEZONE]

Apply a time offset to all time/date stamps in the selected set. An offset argument is required; it may be in the form [-]ss, [-]mm:ss or [+-]hh:mm:ss. The leading sign is optional. With no argument, the default is 1 second.

Optionally you may also specify another argument in the form [+-]hhmm, a timezone literal to apply to each attribution in the range. To apply a timezone without an offset, use an offset literal of 0, +0 or -0.

Clears Q bits, then sets the Q bit for every tag or commit with a modified timestamp.

Those of you twitchy about "rewriting history" should bear in mind that the commit stamps in many older repositories were never very reliable to begin with.

+ CVS in particular is notorious for shipping client-side timestamps with timezone and DST issues (as opposed to UTC) that don’t necessary compare well with stamps from different clients of the same CVS server. Thus, inducing a timequake in a CVS repo seldom produces effects anywhere near as large as the measurement noise of the repository’s own timestamps.

+ Subversion was somewhat better about this, as commits were stamped at the server, but older Subversion repositories often have sections that predate the era of ubiquitous NTP time.

10.8. Miscellanea

[SELECTION] move {tag|reset} [PATTERN] [--not] [SINGLETON]

Move annotated tags or resets.

The PATTERN argument is a pattern expression matching a set of tags or resets. The option "--not" takes the complement of the set implied by the pattern. The second argument must be a singleton selection set designating a commit.

With the qulifier "tag", attach all matching tags to the target commit.

With the qualifier "reset", attach all matching resets to the target commit. If PATTERN is a text literal, each reset’s name is matched if PATTERN is either the entire reference (refs/heads/FOO or refs/tags/FOO for some some value of FOO) or the basename (e.g. FOO), or a suffix of the form heads/FOO or tags/FOO. An unqualified basename is assumed to refer to a branch in refs/heads/. When a reset is moved, no branch fields are changed.

All Q bits are cleared; then any tags ore resets that were moved, get their Q bit set.

[SELECTION] dedup

Deduplicate blobs in the selection set. If multiple blobs in the selection set have the same SHA1, throw away all but the first, and change fileops referencing them to instead reference the (kept) first blob.

renumber

Renumber the marks in a repository, from :1 up to <n> where <n> is the count of the last mark. Just in case an importer ever cares about mark ordering or gaps in the sequence.

A side effect of this command is to clean up stray "done" passthroughs that may have entered the repository via graft operations. After a renumber, the repository will have at most one "done", and it will be at the end of the events.

[SELECTION] transcode ENCODING

Transcode blobs, commit comments, committer/author names, tag comments and tag committer names in the selection set to UTF-8 from the character encoding specified on the command line.

Attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

The ENCODING argument must name one of the codecs listed at https://www.iana.org/assignments/character-sets/character-sets.xhtml and known to the Go standard codecs library.

If a transcode attempt fails on a particular repository object, the object ID and field is logged and the data is left unchanged.

The theory behind the design of this command is that the repository might contain a mixture of encodings used to enter commit metadata by different people at different times. After using "=I" to identify metadata containing non-Unicode high bytes in text, a human must use context to identify which particular encodings were used in particular event spans and compose appropriate transcode commands to fix them up.

This command sets Q bits; objects actually modified by the command get true, all other events get false.

# In all commit comments containing non-ASCII bytes, transcode from Latin-1.
=I transcode latin1

11. Artifact handling

Some commands automate fixing various kinds of artifacts associated with repository conversions from older systems.

11.1. Attributions

[SELECTION] authors {read <INFILE | write >OUTFILE}

Apply or dump author-map information for the specified selection set, defaulting to all events.

Lifts from CVS and Subversion may have only usernames local to the repository host in committer and author IDs. DVCSes want email addresses (net-wide identifiers) and complete names. To supply the map from one to the other, an authors file is expected to consist of lines each beginning with a local user ID, followed by a '=' (possibly surrounded by whitespace) followed by a full name and email address. Thus:

fred = Fred J. Foonly <foonly@foo.com> America/New_York

An authors file may also contain lines of this form

+ Fred J. Foonly <foonly@foobar.com> America/Los_Angeles

These are interpreted as aliases for the last preceding = entry that may appear in ChangeLog files. When such an alias is matched on a ChangeLog attribution line, the author attribution for the commit is mapped to the basename, but the timezone is used as is. This accommodates people with past addresses (possibly at different locations) unifying such aliases in metadata so searches and statistical aggregation will work better.

An authors file may have comment lines beginning with #; these are ignored.

When an authors file is applied, email addresses in committer and author metadata for which the local ID matches between < and @ are replaced according to the mapping (this handles git-svn lifts). Alternatively, if the local ID is the entire address, this is also considered a match (this handles what git-cvsimport and cvs2git do). If a timezone was specified in the map entry, that person’s author and committer dates are mapped to it.

With the 'read' modifier, apply author mapping data (from standard input or a ←redirected input file). Q bits are set: true on each commit event with attributions actually modified by the mapping, false on all other events.

With the 'write' modifier, write a mapping file that could be interpreted by 'authors read', with entries for each unique committer, author, and tagger (to standard output or a >-redirected file). This may be helpful as a start on building an authors file, though each part to the right of an equals sign will need editing.

You can also use 'write' after 'read' to dump a list of the name mappings reposurgeon currently knows about.

11.2. Ignore patterns

ignores [--translate] [--defaults]

Intelligent handling of ignore-pattern files.

This command fails if no repository has been selected, or no preferred write type has been set for the repository. It does not take a selection set.

If --translate is present, the command will fail if the loaded repository has no source type; otherwise, translation of each ignore file is attempted. Pattern lines it can’t translate get commented out; interactively, these are reported along with useful error messages.

After this, all ignore-pattern files are renamed to whatever is appropriate for the preferred type - e.g. .gitignore for git, .hgignore for hg, etc.

If --defaults is present, the command attempts to prepend default patterns for the preferred VCS to all ignore files. If no ignore file is created by the first commit, it will be modified to create one containing the defaults. This command will error out when the VCS type selected by prefer has no default ignore patterns (git and hg, in particular). It will also error out when it knows the import tool has already set default patterns.

Results of this command should be reviewed by a human. The translation rules may be leaky in unusual cases.

All Q bits are cleared, then the Q bit of each modified commit or blob is set.

11.3. Reference lifting

This group of commands is meant for fixing up references in commits that are in the format of older version-control systems. The general workflow is this: first, go over the comment history and change all old-fashioned commit references into machine-parseable reference cookies. Then, automatically turn the machine-parseable cookie into action stamps. The point of dividing the process this way is that the first part is hard for a machine to get right, while the second part is prone to errors when a human does it.

Often Subversion references will be in the form 'r' followed by a string of digits referring to a Subversion commit number. But not always; humans come up with lots of ambiguous ways to write these. CVS commit references are even harder to spot mechanically, as they’re just groups of digits separated by dots with no identifying prefix.

A Subversion cookie is a comment substring of the form ‘[[SVN:ddddd]]’ (example: ‘[[SVN:2355]]’) with the revision read directly via the Subversion exporter, deduced from git-svn metadata, or matching a $Revision$ header embedded in blob data for the filename.

A CVS cookie is a comment substring of the form ‘[[CVS:filename:revision]]’ or ‘[[CVS:revision]]’ (example: ‘[[CVS:src/README:1.23]]’). The filename for the revision-only form is deduced if the commit has exactly one fileop with a path part

A mark cookie is of the form ‘[[:dddd]]’ with d being decimal digits, and is simply a reference to the specified mark. You may want to hand-patch this in when one of the previous forms is inconvenient.

An action stamp is an RFC3339 timestamp, followed by a ‘!’, followed by an author email address (author is preferred rather than committer because that timestamp is not changed when a patch is replayed on to a branch, but the code to make a stamp for a commit will fall back to the committer if no author field is present). It attempts to refer to a commit without being VCS-specific. Thus, instead of “commit 304a53c2” or “r2355”, “2011-10-25T15:11:09Z!fred@foonly.com”.

The following git aliases allow git to work directly with action stamps. Append it to your ~/.gitconfig; if you already have an [alias] section, leave off the first line.

[alias]
	# git stamp <commit-ish> - print a reposurgeon-style action stamp
	stamp = show -s --format='%cI!%ce'

	# git scommit <stamp> <rev-list-args> - list most recent commit that matches <stamp>.
	# Must also specify a branch to search or --all, after these arguments.
	scommit = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d -1\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git scommits <stamp> <rev-list-args> - as above, but list all matching commits.
	scommits = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d --after $d\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git smaster <stamp> - list most recent commit on master that matches <stamp>.
	smaster = "!f(){ git scommit \"$1\" master --first-parent; }; f"
	smasters = "!f(){ git scommits \"$1\" master --first-parent; }; f"

	# git shs <stamp> - show the commits on master that match <stamp>.
	shs = "!f(){ stamp=$(git smasters $1); shift; git show ${stamp:?not found} $*; }; f"

	# git slog <stamp> <log-args> - start git log at <stamp> on master
	slog = "!f(){ stamp=$(git smaster $1); shift; git log ${stamp:?not found} $*; }; f"

	# git sco <stamp> - check out most recent commit on master that matches <stamp>.
	sco = "!f(){ stamp=$(git smaster $1); shift; git checkout ${stamp:?not found} $*; }; f"

There is a rare case in which an action stamp will not refer uniquely to one commit. It is theoretically possible that the same author might check in revisions on different branches within the one-second resolution of the timestamps in a fast-import stream. There is nothing to be done about this; tools using action stamps need to be aware of the possibility and throw a warning when it occurs.

In order to support reference lifting, reposurgeon internally builds a legacy-reference map that associates revision identifiers in older version-control systems with commits. The contents of this map come from three places: (1) cvs2svn:rev properties if the repository was read from a Subversion dump stream, (2) $Id$ and $Revision$ headers in repository files, and (3) the .git/cvs-revisions created by ‘git cvsimport’.

The detailed sequence for lifting possible references is this: first, find possible CVS and Subversion references with the =N visibility set; then replace them with equivalent delimited cookies; then run stampify to turn the cookies into action stamps (using the information in the legacy-reference map) without having to do the lookup by hand.

[SELECTION] stampify

Transform commit-reference cookies into action stamps. You can specify a selection set of commits to be operated on; the default is all commits.

This command expects to find cookies consisting of the leading string '[[', followed by a VCS identifier (e.g SVN, CVS, GIT) followed by VCS-dependent information, followed by ']]'. An action stamp pointing at the corresponding commit is substituted when possible.

Enables writing of the legacy-reference map when the repo is written or rebuilt.

After running this command, it is good practice to do "lint -u" to check for stamp collisions, and if necessary a "timequake" to fix them up.

Sets Q bits: true if a commit’s comment was modified by lift, false on all other events.

It is not guaranteed that every such reference will be resolved, or even that any at all will be. Normally all references in history from a Subversion repository will resolve, but CVS references are less likely to be resolvable.

legacy {read [<INFILE] | write [>OUTFILE]}

Apply or list legacy-reference information. Does not take a selection set. The 'read' variant reads from standard input or a ←redirected filename; the 'write' variant writes to standard output or a >-redirected filename.

A legacy-reference file maps reference cookies to (committer, commit-date, sequence-number) triplets; these in turn (should) uniquely identify a commit. The format is two whitespace-separated fields: the cookie followed by an action stamp identifying the commit.

+ It should not normally be necessary to use this command. The legacy map is automatically preserved through repository reads and rebuilds, being stored in the file legacy-map under the repository subdirectory.

11.4. Changelogs

CVS, Subversion, Mercurial, and Fossil do not have separated notions of committer and author for changesets; when lifted to a VCS that does, like git or bzr/brz, their one author field is used for both.

However, if the project used the FSF ChangeLog convention, many changesets will include a ChangeLog modification listing an author for the commit. In the common case that the changeset was derived from a patch and committed by a project maintainer, but the ChangeLog entry names the actual author, this information can be recovered.

[SELECTION] changelogs [BASENAME-PATTERN]

Mine ChangeLog files for authorship data.

Takes a selection set. If no set is specified, process all changelogs. An optional following argument is a pattern expression to match the basename of files that should be treated as changelogs; the default is "/ChangeLog$/". The match is unanchored. See "help regexp" for more information about regular expressions.

This command assumes that changelogs are in the format used by FSF projects: entry header lines begin with YYYY-MM-DD and are followed by a fullname/address.

When a ChangeLog file modification is found in a clique, the entry header at or before the section changed since its last revision is parsed and the address is inserted as the commit author. This is useful in converting CVS and Subversion repositories that don’t have any notion of author separate from committer but which use the FSF ChangeLog convention.

If the entry header contains an email address but no name, a name will be filled in if possible by looking for the address in author map entries.

In accordance with FSF policy for ChangeLogs, any date in an attribution header is discarded and the committer date is used. However, if the name is an author-map alias with an associated timezone, that zone is used.

Sets Q bits: true if the event is a commit with authorship modified by this command, false otherwise.

The Co-Author convention described in the Linux kernel’s co-author message conventions is observed: If an attribution header is followed by a whitespace-led line containing only a valid email address, that name becomes the payload of a "Co-Author" header that is appended to the change comment for the containing commit.

The command reports statistics on how many commits were altered.

11.5. Clique coalescence

When lifting a history from a version-control system that lacks changesets, it is useful to have a way to recognize cliques of per-file changes that ought to be grouped into changesets.

You won’t need this for CVS because cvs-fast-export does clique coalescence itself.

[SELECTION] coalesce [--changelog] [--debug] [TIMEFUZZ]

Scan the selection set (defaulting to all) for runs of commits with identical comments close to each other in time (this is a common form of scar tissues in repository up-conversions from older file-oriented version-control systems, notably CVS). Merge these cliques by pushing their fileops and tags up to the last commit, in order.

The optional argument, if present, is a maximum time separation in seconds; the default is 90 seconds.

The default selection set for this command is "=C", all commits. Occasionally you may want to restrict it, for example to avoid coalescing unrelated cliques of "empty log message" commits from CVS lifts.

With the --changelog option, any commit with a comment containing the string 'empty log message' (such as is generated by CVS) and containing exactly one file operation modifying a path ending in 'ChangeLog' is treated specially. Such ChangeLog commits are considered to match any commit before them by content, and will coalesce with it if the committer matches and the commit separation is small enough. This option handles a convention used by Free Software Foundation projects.

With the --debug option, show messages about mismatches.

Sets Q bits: true on commits that result from coalescence, false otherwise.

12. Control Options

The following options change reposurgeon’s behavior:

asciidoc

Dump help items using asciidoc definition markup.

canonicalize

If set, import stream reads and msgin will canonicalize comments by replacing CR-LF with LF, stripping leading and trailing whitespace, and then appending a LF. This behavior inverts if the crlf option is on - LF is replaced with Cr-LF and CR-LF is appended.

crlf

If set, expect CR-LF line endings on text input and emit them on output. Comment canonicalization will map LF to CR-LF.

compress

Use compression for on-disk copies of blobs. Accepts an increase in repository read and write time in order to reduce the amount of disk space required while editing; this may be useful for large repositories. No effect if the edit input was a dump stream; in that case, reposurgeon doesn’t make on-disk blob copies at all (it points into sections of the input stream instead).

echo

Echo commands before executing them. Setting this in test scripts may make the output easier to read.

experimental

This flag is reserved for developer use. If you set it, it could do anything up to and including making demons fly out of your nose.

fakeuser

Fake the ID of the invoking user. Use in regression-test loads.

interactive

Enable interactive responses even when not on a tty.

materialize

Force creation of content blobs on disk when reading a stream file, even when it is randomly accessible and the metadata could point at extents in the file. Use in regression-test loads to exercise handling of materialized blobs.

progress

Enable fancy progress messages even when not on a tty.

quiet

Suppress time-varying parts of reports.

relax

Continue script execution on error, do not bail out.

serial

Disable parallelism in code. Use for generating test loads.

Most options are described in conjunction with the specific operations that they modify.

Here are the commands to manipulate them. None of these take a selection set:

set {flag[s] [asciidoc|canonicalize|crlf|compress|echo|experimental|fakeuser|interactive|materialize|progress|quiet|relax|serial]+ | logfile [PATH] | readlimit [limit] | desclimit [limit]}

"set flag" sets one or more (tab-completed) options to control reposurgeon’s behavior. With no arguments, displays the state of all flags. Do "help options" to see the available options.

"set logfile": Error, warning, and diagnostic messages are normally emitted to standard error. This command, with a nonempty PATH argument, directs them to the specified file instead. The PATH may be a bare token or a double-quoted string. Without an argument, reports what logfile is set.

"set readlimit" sets a maximum number of commits to read from a stream. If the limit is reached before EOF it will be logged. Mainly useful for benchmarking. Without arguments, report the read limit; 0 means there is none.

"set desclimit" serts the description-line length limit used by the =L selector. It defaults to 50, which is the limit recommended in the Git documentation. This limit is settable because for older project histories it may be so restrictive as to require lots of noisy changes. The limit for later lines is always 72.

clear {flag[s] [asciidoc|canonicalize|crlf|compress|echo|experimental|fakeuser|interactive|materialize|progress|quiet|relax|serial]+ | readlimit [limit]}

"clear flag[s]" clears (tab-completed) boolean options to control reposurgeon’s behavior. With no arguments, displays the state of all flags. Do "help options" to see the available options.

"clear logfile" redirects logging output to the default, stdout.

"clear readlimit" removes any readlimit that has been set.

13. Scripting and debugging support

13.1. Variables, macros, and scripts

There are two different way to package command sequences for re-execution; scripts and macros. Both are invoked by the same "do" command. Scripts are the facility you can use use to build and store conversion recipes as you develop them.

Occasionally you will need to issue a large number of complex surgical commands of very similar form, and it’s convenient to be able to package that form as a text template so you don’t need to do a lot of error-prone typing. For those occasions, reposurgeon supports simple forms of named variables and macro expansion.

define [NAME [TEXT]]

Define a macro. The first whitespace-separated token is the name; the remainder of the line is the body, unless it is '{', which begins a multi-line macro terminated by a line beginning with '}'.

A later 'do' call can invoke this macro.

'define' by itself without a name or body produces a macro list.

do NAME [ARG…​]

Takes a NAME and optional following arguments. NAME and arguments may be bare tokens or double-quoted strings, with the quotes discarded before interpretation.

First, try to expand and perform a macro. The first argument is the name of the macro to be called; remaining arguments replace %{1}, %{2}…​ in the macro definition. Arguments may contain whitespace if they are string-quoted; string quotes are stripped. Macros can call macros to arbitrary depth.

If the macro expansion does not itself begin with a selection set, whatever set was specified before the 'do' keyword is available to the command generated by the expansion.

If no macro named NAME exists, assume NAME is a filename and execute it as a script, reading each line from the file and executes it as a command.

During execution of the script, the script name replaces the string "$0", and the optional following arguments (if any) replace the strings "$1", "$2" …​ "$n" in the script text. This is done before tokenization, so the "$1" in a string like "foo$1bar" will be expanded. Additionally, "$$" is expanded to the current process ID (which may be useful for scripts that use tempfiles).

(The Unix shell syntax ${n} will not expand into a script argument. Don’t confuse it with ${n} used for regular expression match group references.)

Within scripts (and only within scripts) reposurgeon accepts a slightly extended syntax: First, a backslash ending a line signals that the command continues on the next line. Any number of consecutive lines thus escaped are concatenated, without the ending backslashes, prior to evaluation. Second, a command that takes an input filename argument can instead take literal data using the syntax of a shell here-document. That is: if the "<filename" is replaced by "<<EOF", all following lines in the script up to a terminating line consisting only of "EOF" will be read, placed in a temporary file, and that file fed to the command and afterwards deleted. "EOF" may be replaced by any string. Backslashes have no special meaning while reading a here-document.

Any script line beginning with a "#" is ignored.

In scripts, all commands that expect data to be presented on standard input also accept a here-document, just the shell syntax for here-documents with a leading "<<". There are two here-documents in the quick-start example.

Scripts may call other scripts to arbitrary depth.

When running a script interactively, you can abort it by typing Ctrl-C and return to the top-level prompt. The abort flag is checked after each script line is executed.

SELECTION assign [--singleton] [NAME]

Compute a leading selection set and assign it to a symbolic name, which must follow the assign keyword. It is an error to assign to a name that is already assigned, or to any existing branch name. Assignments may be cleared by some sequence mutations (though not by ordinary deletion); you will see a warning when this occurs.

With no selection set and no argument, list all assignments. This version accepts output redirection.

If the option --singleton is given, the assignment will throw an error if the selection set is not a singleton.

Use this to optimize out location and selection computations that would otherwise be performed repeatedly, e.g. in macro calls.

Example:

# Assign to the name "cvsjunk" the selection set of all commits with a
# boilerplate CVS empty log message in the comment.
/empty log message/ assign cvsjunk
unassign NAME

Unassign a symbolic name. Throws an error if the name is not assigned. Tab-completes on the list of defined names.

undefine MACRO-NAME

Undefine the macro named in this command’s first argument.

Here are some more advanced examples of scripting:

define lastchange {
@max(=B & [/ChangeLog/] & /%{0}/B)? list
}

List the last commit that refers to a ChangeLog file containing a specified string. (The trick here is that ? extends the singleton set consisting of the last eligible ChangeLog blob to its set of referring commits, and list only notices the commits.)

index >index.txt
shell <index.txt awk '/refs\/tags/ {print $4}' | sort | uniq | while read t; do echo "tag $(basename "$t") rename $(basename "$t" | sed -e 's/sample/example/')"; done >renames.script
script renames.script

Mass-rename tags, replacing "sample" on the basename with "example". Illustrates a general technique of generating reposurgeon commands via shell that you then execute with the ‘[script_cmd]’ command. Enabling this technique is the reason as many commands as possible support redirects.

13.2. Housekeeping

gc [GOGC] [>OUTFILE]

Trigger a garbage collection. Scavenges and removes all blob events that no longer have references, e.g. as a result of delete operations on repositories. This is followed by a Go-runtime garbage collection.

The optional argument, if present, is passed as a SetPercentGC call to the Go runtime. The initial value is 100; setting it lower causes more frequent garbage collection and may reduces maximum working set, while setting it higher causes less frequent garbage collection and will raise maximum working set.

The current GC percentage (after setting it, if an argument was given) is reported.

13.3. Diagnostics

The destination of diagnostic logging is set by "set logfile"

log [[+-]LOG-CLASS]…​

Without an argument, list all log message classes, prepending a + if that class is enabled and a - if not.

Otherwise, it expects a space-separated list of "<+ or →<log message class>" entries, and enables (with + or no prefix) or disables (with -) the corresponding log message class. The special keyword "all" can be used to affect all the classes at the same time.

For instance, "log -all shout +warn" will disable all classes except "shout" and "warn", which is the default setting. "log +all -svnparse" would enable logging everything but messages from the svn parser.

A list of available message classes follows; most above "warn" level or above are only of interest to developers, consult the source code to learn more. 0

shout
warn
baton
tagfix
topology
properties
extract
filemap
ancestry
delete
svnparse
emailin
shuffle
commands
unite
lexer

13.4. Debugging

A few commands have been implemented primarily for debugging and regression-testing purposes, but may be useful in unusual circumstances.

SELECTION resolve

Does nothing but resolve a selection-set expression and report the resulting event-number set to standard output. The remainder of the line after the command, if any, is used as a label for the output.

Implemented mainly for regression testing, but may be useful for exploring the selection-set language.

The parenthesized literal produced by this command is valid selection-set syntax; it can be pasted into a script for re-use.

version [EXPECT]

With no argument, display the reposurgeon version and supported VCSes. With argument, declare the major version (single digit) or full version (major.minor) under which the enclosing script was developed. The program will error out if the major version has changed (which means the surgical language is not backwards compatible).

It is good practice to start your lift script with a version requirement, especially if you are going to archive it for later reference.

[SELECTION] hash [--tree] [>OUTFILE]

Report Git event hashes. This command simulates Git hash generation.

Takes a selection set, defaulting to all. For each eligible event in the set, returns its index and the same hash that Git would generate for its representation of the event. Eligible events are blobs and commits.

With the option --bare, omit the event number; list only the hash.

With the option --tree, generate a tree hash for the specified commit rather than the commit hash. This option is not expected to be useful for anything but verifying the hash code itself.

[SELECTION] strip {--reduce [--fileops]|--blobs|--obscure}

This is intended for producing reduced test cases from large repositories.

with the modifier "--blobs", replace the blobs in the selected repository with self-identifying stubs. This will drastically reduce the size of the repository which preserving its structure. This is the default mode if no option is given.

With the modifier "--reduce", perform a topological reduction that throws out uninteresting commits. If a commit has all file modifications (no deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. With the modifier "--fileops", all file operations (even deletions or copies or renames) are considered boring, which may be useful if you want to examine a repository’s branching/tagging history. To be fully boring, the commit must also not be referred to by any tag or reset. Interesting commits are not boring, or have a non-boring parent or non-boring child.

With the modifier --obscure, map all file paths to nonce strings, preserving directory structure and distinctness. This can be used in extreme cases where even the file paths might unacceptably leak information about the repository content.

If more than one strip mode is specified, blob stubbing is performed first, then reduction, then path obscuration.

A selection set is effective only with the "--blobs" and "--obscure" options, defaulting to all blobs or commits respectively. The "--reduce" mode always acts on the entire repository.

This command sets Q bits on each modified object.

13.5. Profiling

The "set readlimit" option can be used to limit processing to head segments of an input stream file; this can be useful for profiling and fault isolation.

Profiling commands are:

show {elapsed|memory|sizeof|when TIMESTAMP|vcs [NAME…​]} [>OUTFILE]

The "show" command generates reports that do not require a repository to be loaded.

With "elapsed", display elapsed time since start.

With "memory", eport memory usage. Runs a garbage-collect before reporting so the figure will better reflect storage currently held in loaded repositories; this will not affect the reported high-water mark.

With "sizeof", report byte-extent sizes for various reposurgeon internal types. Note that these sizes are stride lengths, as in C’s sizeof(); this means that for structs they will include whatever trailing padding is required for instances in an array of the structs. This command is for developer use when optimizing structure packing to reduce memory use. It is probably not of interest to ordinary reposurgeon users.

With "vcs", show what reposurgeon knows about version-control systems. Without an argument, list all known ones. With arguments, list details for a specified one.

With "when", try to interpret the input line as a timestamp and interconvert between Git and RFC3339 format - can be useful when eyeballing export streams. Git timestamps (integer Unix time plus TZ) are supported; so are bare numbers which are interpreted as seconds since UTC (as if they were Git timestamps with a +0000 time offset). Also expects several variants of RFC1123Z dates, including Git log format.

checkpoint [MARK-NAME] [>OUTFILE]

Report phase-timing results from analysis of the current repository.

If the command has a following argument, this creates a new, named time mark that will be visible in a later report; this may be useful during long-running conversion recipes.

profile [live PORT | start SUBJECT FILENAME | save SUBJECT [FILENAME] | bench]

Manages data collection for profiling.

For a list of available profile subjects, call the profile command without arguments. The list is in part extracted from the Go runtime and is subject to change.

For documentation on the Go profiler used by the live and start modes, see

Profiling is enabled by default, but viewing the profile data requires either starting the HTTP server with "profile live", or saving it to a file with "profile save". When no arguments are given it prints out the available types of profiles.

With "live", starts an HTTP server on the specified port which serves the profiling data. If no port is specified, it defaults to port 1234. Use in combination with pprof, with a command like

+ go tool pprof -http=":8080" http://localhost:1234/debug/pprof/<subject>;

+ With "start", starts the named profiler, and tells it to save to the named file, which will be overwritten. Currently only the cpu and trace profilers require you to explicitly start them; all the others start automatically. For the others, the filename is stored and used to automatically save the profile before reposurgeon exits.

+ With "save", saves the data from the named profiler to the named file, which will be overwritten. If no filename is specified, this will fall back to the filename previously stored by 'profile start'.

+ With "bench", report elapsed time and memory usage in the format expected by repobench. Note: this comment is not intended for interactive use or to be used by scripts other than repobench. The output format may change as repobench does. Runs a garbage-collect before reporting so the figure will better reflect storage currently held in loaded repositories; this will not affect the reported high-water mark.

exit [>OUTFILE]

Exit cleanly, emitting a goodbye message including elapsed time.

14. Working with Subversion (svn)

The transaction model of Subversion is nothing like that of the DVCSes (distributed version-control systems) that followed it. Two of the more obvious differences are around tags and branches. These differences occasionally lead to conversion problems.

A Subversion tag isn’t an annotation attached to a commit. The Subversion data model is that a history is a sequence of surgical operations on a tree; there are no annotation tags as such, a tag is just another branch of the tree. Accordingly a Subversion tag is a copy of the state of an entire branch at a particular revision. This can be losslessly translated to an annotation only if no additional commits are added to the tag branch after the copy. But nothing prevents this! Reposurgeon tries to do the right thing, creating a DVCS-style annotated tag using the metadata of the copy operation when it can and otherwise preserving the changes as commits, using a lightweight tag to point at the tip.

There is a subtler problem around branches themselves. In a DVCS, deleting a branch removes it from the repository history entirely, a fact of some significance since repositories are copied around often enough that keeping every discarded experiment forever would eventually drown the live content in superannuated cruft. Subversion repositories, on the other hand, are designed on the assumption that they sit on one server and never move. A Subversion branch is just a directory in the branch namespace; if you delete it, you won’t see it in following revisions, but if you update to an older one that content will still be there. By default, reposurgeon will delete the corresponding branches as if the deletion was done in a DVCS, keeping only the commits that are also part of other branches' histories, but you can tell it to preserve the branches instead and give them unambiguous names in the refs/deleted namespace.

Bad things can happen when a tag directory is created, copied from, deleted, then recreated from a different source directory. This is a place where the Subversion model of tags clashes destructively with the changeset-DAG model used by git and other DVCSes, especially if the same tag is recreated later! The obvious thing to do when converting this sequence would be to just nuke the tag history from the deletion back to its branch point, but that will cause problems if a copy operation was ever sourced in the deleted branch (and this does happen!).

What reposurgeon does instead is preserve the most recent branch with any given name, so the view back from any branch tip in the repository has the correct content. This does however mean that reposurgeon discards the content of any previous branch having that same name. However, see the --preserve option of the read command.

The reposurgeon analyzer tries to warn you about pathological cases, and reposurgeon gives you tools for coping with them. Unfortunately, the warnings are (unavoidably) cryptic unless you understand Subversion internals in detail.

There’s another problem around Subversion merges. In a DVCS, a merge normally coalesces two entire branches. Subversion has something close to this in newer versions; it’s called a "sync merge" working on directories (and is expressed as an svn:mergeinfo property of the target directory that names the source). A sync merge of a branch directory into another branch directory behaves like a DVCS merge; reposurgeon picks these up and translates them for you.

The older, more basic Subversion merge is per file and is expressed by per-file svn:mergeinfo properties. These correspond to what in DVCS-land are called "cherry-picks", which just replay a commit from a source branch onto a target branch but do not create cross-branch links.

Sometimes Subversion developers use collections of per-file mergeinfo properties to express partial branch merges. This does not map to the DVCS model at all well, and trying to promote these to full-branch merges by hand is actually dangerous. An excellent essay, Partial git merges — just say no, explores the problem in depth.

The bottom line is that reposurgeon warns about per-file svn:mergeinfo properties and then discards them for good reasons. If you feel an urge to hand-edit in a branch merge based on these, do so with care and check your work.

Three minor issues to watch for:

  1. Superfluous root tags.

  2. Legacy Subversion revision numbers.

  3. File content mismatches due to $-keyword expansion.

More details follow.

14.1. Reading Subversion repositories

Note that the Subversion dump reader only supports versions 1 and 2 of the dump file format, not version 3 with diff-based file changes. This shouldn’t be a problem with normal use of reposurgeon, which calls svnadmin dump in its default mode generating version 2.

If a stream file has gaps in the revision sequence, branch copy references to nonexistent revisions will be patched to refer to the most recent existing revision before the missing one.

Certain optional modifiers on the read command change its behavior when reading Subversion repositories:

--user-ignores

By default reposurgeon tosses out in-tree .gitignore files found in the history because they probably come from git-svn users who checked-in their own .gitignore files. Using this option makes reposurgeon keep the content of these files and merge them with the .gitignore files generated from svn:ignore and svn:global-ignores properties, if any.

--no-automatic-ignores

Do not generate .gitignore files from svn:ignore and svn:global-ignores properties. If --user-ignores is also used then only .gitignore files that were present in the SVN tree will exist in the final repository. If --user-ignores is not used, no .gitignore file at all will survive the conversion.

--preserve

When a branch or tag was deleted in SVN, preserve the history up to deletion in a git ref under refs/deleted/, instead of deleting the branch and only keeping the commits that are also part of the history of other branches. The reference is disambiguated using the base revision number of the dead branch. Also, preserve branch-copy commits auto-generated by cvs2svn that would otherwise be discarded. (Note that the reason --preserve is not the default behavior is because of experience with large old repositories that may have hundreds or even thousands of dead branches. While it is important that content copies from dead branches be resolved correctly, the branches themselves are almost never interesting.)

These modifiers can go anywhere in any order on the command line after the read verb. They must be whitespace-separated.

As stacking up read options can result in a very long read invocation line, it’s useful to know that backslash is accepted as a continuation character.

It is also possible to embed a magic comment in a Subversion stream file to set these options (this is mainly useful in test loads). Prefix a space-separated list of them with the magic comment ‘ # reposurgeon-read-options:’; the leading space is required. This may be useful when synthesizing test loads; in particular, a stream file that does not set up a standard trunk/branches/tags directory layout can use this to perform a mapping of all commits onto the master branch that the git importer will accept.

Here are the rules used for mapping subdirectories in a Subversion repository to branches:

  • trunk’ always becomes the master branch. Directories under ‘branches’ become gitspace branches; directories under ‘tags’ become gitspace tags.

  • Each potential tag is checked to see if it has commits on it after the initial creation or copy. If there are such commits, or if the branch creation or copy introduces changes other than the copy, it becomes a branch. If not, it may become a tag in order to preserve the commit metadata. In all cases, the name of any created tag or branch is the basename of the directory.

  • Files in the top-level directory are assigned to a synthetic branch named ‘unbranched’. If the Subversion repository has a branch named ‘unbranched’ the name ‘unbranched-bis’ is used instead; actually, ‘-bis’ is appended enough times to get to an unused branch name.

Branch-creation operations with no following commits are usually tagified. However, this is done to preserve comment/committer data entered by users; when reposurgeon can detect that a branch-creation comment was automatically generated (as often happens in cvs2svn conversions) the commit will simply be discarded so as not to create clutter that has to be manually removed by the operator. (That discard action is prevented by the --preserve option.)

Otherwise, each commit that only creates or deletes directories (in particular, copy commits for tags and branches, and commits that only change properties) will be transformed into a tag named after the tag or branch, containing the date/author/comment metadata from the commit.

Subversion branch deletions are turned into deletealls, clearing the fileset of the import-stream branch. When a branch finishes with a deleteall at its tip, the deleteall is transformed into a tag. This rule cleans up after aborted branch renames.

Occasionally (and usually by mistake) a branchy Subversion repository will contain revisions that touch multiple branches. These are handled by partitioning them into multiple import-stream commits, one on each affected branch. The Legacy-ID of such a split commit will have a 1-origin split number separated by a dash - for example, if Subversion revision 2317 touches three branches, the three generated commits will have IDs 2317-1, 2317-2, and 2317-3.

The svn:executable and svn:special properties are translated into permission settings in the input stream; svn:executable becomes 100755 and svn:special becomes 120000 (indicating a symlink; the blob contents will be the path to which the symlink should resolve).

Any cvs2svn:rev properties generated by cvs2svn are incorporated into the internal map used for reference-lifting, then discarded.

Normally, per-directory svn:ignore properties (and svn:global-ignores properties, e.g. in a site configuration file) become .gitignore files. Actual .gitignore files in a Subversion directory are presumed to have been created by git-svn users separately from native Subversion ignore properties and discarded with a warning. It is up to the user to merge the content of such files into the target repository by hand. But this behavior is changed by the --user-ignores option which disables filtering of in-tree .gitignore files and instead merges them with .gitignore files generated from Subversion properties. On the other hand, the --no-automatic-ignores option discards Subversion svn:ignore and svn:global-ignores properties without translation.

Any .cvsignore files left over from a Subversion repository’s ancient history as a CVS repository are deleted. Instead the (presumably more up-to-date) Subversion ignore properties are translated.

svn:mergeinfo properties are interpreted. Any svn:mergeinfo property on a revision A with a merge source containing all revisions on a branch from the forking point (or the branch start if the histories are independent) up to revision B produces a merge link such that the branch tip at revision B becomes a parent of A. The "svnmerge-integrated" properties produced by Subversion’s svnmerge.py script are handled the same way.

All other Subversion properties are discarded. (This may change in a future release.) The property for which this is most likely to cause semantic problems is svn:eol-style. However, since property-change-only commits get turned into annotated tags, the translated tags will retain information about setting changes.

The sub-second resolution on Subversion commit dates is discarded; Git wants integer timestamps only. Normally Subversion timestamps are rounded down, but when two adjacent timestamps have the same seconds part and the later one is in the top half-second of the interval, the later one is rounded up instead. This does much to reduce collisions while guaranteeing that no timestamp is ever shifted to a non-adjacent second mark.

Because fast-import format cannot represent an empty directory, empty directories in Subversion repositories will be lost in translation.

Subversion local usernames are mapped in the style of git cvs-import; user ‘foo’ becomes ‘foo <foo>’, which is sufficient to pacify git and other systems that require email addresses. You can remap them to real addresses using the authors read command.

Reading a Subversion stream enables writing of the legacy map as 'legacy-id' passthroughs when the repo is written to a stream file.

Reposurgeon tries hard to silently do the right thing, but there are Subversion edge cases in which it emits warnings because a human may need to intervene and perform fixups by hand. Here are the less obvious messages it may emit:

properties set

reposurgeon has detected a setting of a user-defined property, or the Subversion properties svn:externals. These properties cannot be expressed in an import stream; the user is notified in case this is a showstopper for the conversion or some corrective action is required, but normally this error can be ignored. This warning is suppressed by the --ignore-properties option.

Detected link from <revision> to <revision> might be dubious

When trying to detect parent links from multiple file copies like what cvs2svn can produce, source revisions of the different copies were not all the same. The link should probably be monitored because it has a non-negligible probability of being slightly wrong. This does not impact the tree contents, only the quality of the history.

14.2. Mid-branch deletions

When a branch A is deleted and a branch B is copied to the name A, the Subversion intent is to replace the contents of branch A with the contents of branch B, keeping the A name. This is a poor man’s merge from before "svn merge" existed. Many Subversion users who formed their habits before svn merge existed still operate this way.

In git terms, this almost corresponds to a merge of A into B followed by a rename of B to A. Branch B continues to exist, however, so we can’t do that in translation. The reposurgeon logic does not try to be clever about this, because "clever" would have rebarbative edge cases; the sequence is translated into a deleteall followed by a commit operation that recreates the B files under corresponding A names. No merge link is created. The commit filling A with a branch copy from B will have B as its first parent, though, so all that would be needed is to create a merge link from the old A before the delete to the commit recreating A.

This case is mentioned here because it is likely to confuse the merge-tracking algorithms used, e.g., by git diff, or if you ever try to merge a branch that forked off the old A to a branch spun off the new (and expect git to know that you do not want to incorporate old A’s changes).

14.3. Multiproject Subversion repositories

By convention, subversion repository is supposed to have a regular structure, what we’ll call "standard layout". This has three top-level directories, trunk/ and tags/ and branches/. In standard layout, directories beneath tags are never modified after the copy that creates them.

A repository in standard layout can be converted nearly losslessly into Git and other modern DVCSes, with minor exceptions around deleted and later recreated branches and some Subversion-specific file properties.

But standard layout is only a convention; the Subversion tools don’t actually enforce it. One very simple non-standard layout is a "flat" repository. We’ll discuss these in detail in a later section.

In a more common non-standard layout, Subversion repositories are organized to hold multiple projects, with the root directory containing one subdirectory per project and each project subdirectory having a standard layout underneath it.

We’ll call this kind of layout "conformable" when it is perfectly regular, not containing any of the following exceptions:

  • Top-level project directories that don’t have a standard layout underneath

  • Copy operations on top-level subdirectories.

  • trunk/tags/branches structure more than one level beneath the top level.

These sorts of irregularities do not fit well into reposurgeon’s internal data model. The fallback behaviors in reposurgeon that attempt to handle them are fragile and kludgy. Attempting to read such a stream may confuse or even crash reposurgeon’s reader for Subversion dumps.

Unfortunately, it is rare for a multiproject repository not to have defects of this kind. There’s an easy way to check; feed the stream to repocutter swapcheck. If you get an empty report, the stream is perfectly conformable; otherwise, each report line flags a defect you need to fix or remove. Look at the output of "repocutter help swapcheck" for more details.

Conversions of non-conformable multiproject Subversion repositories are messy and tricky. If you work for an organization with a budget, you should seriously consider engaging the author to do yours - that’s likely to save you time and money relative to what you’d pay someone in-house to climb the learning curve.

The general strategy in these cases has three steps:

  1. Transform a stream dump from the repository into a fully-conformable one.

  2. Apply repocutter swapsvn to transform that stream to one with standard layout.

  3. Load the standard-layout stream into reposurgeon and trim out unneeded content inside reposurgeon.

The first step is the tricky one. There are two general tactics you can pursue towards this: (1) expunging badly-structured content, and/or (2) using repocutter pathrename to massage non-conformable directories into conformable ones.

Before you think about doing expunges, be aware that they can be complicated by cross-project copy operations. If you delete a directory that’s ever a source for a copy into conformable structure, havoc will ensue; a very common case of this is copies from obsolete branches to trunk. That’s why it’s generally best to defer trimming unwanted history until you have it inside reposurgeon.

Generally you’re going to do the heavy lifting with repocutter pathrename operations. The author has seen conversions that literally required hundreds of these. You may also need to do repocutter filecopy and skipcopy operations - but here be dragons; by the time you need to do those you should reconsider not hiring the author.

It cannot be emphasized enough that conversions of non-conformable multi-project repositories are difficult; expect your timeframe to completion to be measured not in hours or days but months. Improving to tools to make these less painful is the reposurgeon project’s main area of ongoing research.

14.3.1. Some caveats about repocutter swapsvn

All project-local tags and branches will be promoted to become tags and branches across the entire repository.

Paths that swapsvn does not recognize as conformable will be passed through unaltered. Don’t try to treat this as a feature; reposurgeon’s fallbacks for dealing with these are crude and fragile.

Subversion mergeinfos, being file- and directory-oriented rather than true branch merges, often can’t be carried over even in a single-project conversion. The global branch coalescence that repocutter swapsvn does multiplies the problem; you may simply have to discard those properties to get a clean conversion.

See the embedded help listed by "repocutter help swapsvn" for some details and caveats.

14.4. Flat repositories

If the repository has no branch structure at all, you might want to filter the stream with "repocutter push trunk" and then individually turning subdirectories into branches using the branchlift command. The pattern for that looks like this:

read <multiproject.svn
branchlift master branches/release-1.0
branchlift master branches/release-2.0
rename path "trunk/(.*)" "${1}"

This will turn two subdirectories into branches named "release-1.0" and "release-2.0", then pops "trunk" off all the paths where it occurs (leaving those commits on the master branch).

This method has the disadvantage that you have to enumerate all branches you want to lift. Still, it may be useful on repositories that consist of one big unbranched file tree not conforming to a standard layout.

If for some reason you want to treat a branchy repository as though it’s flat, using repocutter to rename the "trunk" directory to something else first (like, say, "main") will foil the logic in the reposurgeon stream reader that wants to do branch and tag analysis.

15. Working with CVS (cvs)

When you are converting a CVS repository using reposurgeon, most of the heavy lifting will have been done by the importer - cvs-fast-export. In particular, it coalesces CVS per-file changes into changesets when it detects that they have identical comments and attributions and are close in time, and it converts .cvsignore files to .gitignores.

A CVS repository normally consists of a set of module subdirectories and a CVSROOT directory containing metadata. cvs-fast-export ignores CVSROOT; thus you can run reposurgeon at any level of a directory tree containing CVS master files, and it will try to lift what it can see at and below the current directory it is run from.

If you do this at the top level of the repository directory, your converted repository will have a subdirectory corresponding to each module. This is normally not the way you want to do things, as CVS tags are not likely to be consistent across all modules and thus won’t lift correctly. You probably want to do individual module conversions.

Problems in CVS conversions generally arise from the fact that CVS’s data model doesn’t have real multi-file changesets, which are the fundamental unit of a commit in DVCSes. It can be difficult to fully recover changesets from what are actually large numbers of single-file changes flying in loose formation - in fact, old CVS operator errors can sometimes make it impossible. Bad tools silently propagate such damage forward into your translation. Good tools, like cvs-fast-export and reposurgeon, warn you of problems and help you recover.

Here are the kinds of conversion glitches to watch for:

  1. Failure to coalesce runs of comments with identical attribution and comment text.

  2. Superfluous root tags.

  3. Legacy CVS revision numbers.

  4. File content mismatches due to $-keyword expansion.

  5. "Zombie" files due to failure to track deletion operations.

Details follow.

Glitch #1 Is driven by whether the window defining "close in time" is wide enough. If it’s not, you may detect commit groups with the same committer and comment text that should have been merged into one changeset but were not. You can either clean these up with the ‘coalesce’ command in reposurgeon or run cvs-fast-export by hand with a larger -w option and read in the generated stream.

Glitch #2: In cleaning up a CVS conversion that is unique to that system is deleting root tags - tags which have "-root" as a name suffix and mark the beginning of a branch, CVS uses these for bookkeeping, but later systems don’t need them. They’re just clutter and can be removed.

Glitch #3: It’s also worth paying careful attention to reference-lifting so that you can scrub useless CVS revision numbers out of comments. This is a more pressing issue than it is with Subversion, where changesets map to changesets, and conversions have the option of marking each target changeset with its revision number.

Glitch #4: You can spot content mismatches due to keyword expansion easily. They will produce single-line diffs of lines containing dollar signs surrounding keyword text. Because binary files can be corrupted by keyword expansion, cvs-fast-export behaves like cvs -kb mode and does no keyword expansion of its own.

Glitch #5: Manifest mismatches on tags are most likely to occur on files which were deleted in CVS but persist under later tags in the Git conversion. You can bet this is what’s going on if, when you search for the pathname in the CVS repository, you find it in an Attic directory.

These spurious reports happen because CVS does not always retain enough information to track deletions reliably and is somewhat flaky in its handling of "dead"-state revisions. To make your CVS and git repos match perfectly, you may need to add delete fileops to the conversion - or, more likely, move existing ones back along their branches to commits that predate the gitspace tag - using reposurgeon.

Manifest mismatches in the other direction (present in CVS, absent in gitspace) should never occur. If one does, submit a bug report.

Any other kind of content or manifest match - but especially any on the master branch - is bad news and indicates either a severe repository malformation or a bug in cvs-fast-export (or possibly both). Any such situation should be reported as a bug.

Conversion bugs are disproportionately likely to occur on older branches or tags made with CVS version before CVS got commit IDs in 2006 (version 2.12). Often the most efficient remedy is simply to delete junk branches and tags; reposurgeon(1) makes this easy to do.

16. Working with Mercurial (hg)

There is a built-in extractor class to perform extractions from hg repositories.

There are some important caveats about hg. Please read this section carefully before attempting conversion or surgery.

In hg the "branch" of a commit is a name carried in the commit and inherited by any child of it that is created - it behaves like a color on the commit rather than being implied by an ancestry relationship to a named branch tip (as in git) or by a location in a directory layout (as in svn).

When different people add commits to a branch, and then push their results to the same upstream repository, each person’s commits become different commit sequences, each attached to wherever the branch head was at the time they last pulled. So a branch can have multiple heads. The heads themselves do not have names, other than the hex IDs of their tip commits.

Typically someone will then merge the changes from all the heads into following commits on their branch. There is an illuminating diagram of this.

Before importing a Mercurial repository, you should merge-resolve all branches so they have single heads first. The extractor can’t handle multiple-headed branches gracefully. This is because reposurgeon internally uses a data model similar to git’s, but enriching that model wouldn’t help much because no other VCS has an analog multi-headed branches.

In hg branches can have a "closed" status marking that line of development as terminated. This flag is not represented in reposurgeon and can’t be exported to other VCSes.

There is another feature in hg that resembles named branches: bookmarks - symbolic names for branch-tip commits that you assign, which won’t be identical to the name of the branch you’re on unless you set them that way, and which move to the new tip on the branch every time you add one. There’s a detailed description.

By default, bookmarks are ignored in favor of the name of the branch they are on. You can specify explicit handling for bookmarks by setting ‘reposurgeon.bookmarks’ in your .hg/hgrc. Set the value to the prefix that reposurgeon should use for bookmarks.

For example, if your bookmarks represent branches, put this at the bottom of your .hg/hgrc:

[reposurgeon]
bookmarks=heads/

If you do that, it’s your responsibility to ensure that branch names do not conflict with bookmark names. You can add a prefix like ‘bookmarks=heads/feature-’ to disambiguate as necessary.

Mercurial tags are name-to-commit mappings stored in a (versioned) dotfile in your repository directory. They are exported to other VCSes as lightweight tags.

The hg extractor does not attempt to recursively handle subrepos. Rather, it will extract the history of the top-level repo, in which .hgsub and .hgsubstate will be treated as regular files. If you wish to translate these into the semantics of your target VCS, you will need to do so with surgical primitives after reading the history into reposurgeon.

17. Tips and Tricks

17.1. Use the Q set to check your work

When you’re experimenting with mass changes based on pattern matching (as in filter, transcode, changelogs, or even msgin) bear in mind that commits and other objects actually modified get their Q bits set. You can use this as a selection with "list" commands to check your work.

17.2. Pushing and popping pathname segments

Because "/" is the default delimiter for regular expressions, it may not be obvious how to write commands that have to manipulate pathnames by segment. Here is how to handle some common cases:

# Push a pathname segment onto filenames
rename path /^(.*)/ "foo/${1}"

# Pop a pathname segment off filenames
rename path :foo/(.*): "${1}"

The second recipe uses the fact that almost any delimiter character will actually do. But note that you can’t use single quote; that is interpreted as instruction to match a literal string rather than a regexp.

You also can’t use double quotes, as those are striped off argument tokens before they are passed to the functions that interpret them. Thus , in the following command…​

rename path "foo/(.*)" "${1}"

…​the first argument would be interpreted as a literal path name and not a regexp. However, this would work as probably intended:

rename path ":foo/(.*):" "${1}"

That is, the double quotes will be stripped off at argument parsing time and then the matching colons will cause the first argument to be recognized as a delimited regexp. This will be useful if you need to specify a delimited regexp that contains whitespace; you can string-quote it so the argument parser will pass it as a single token.

17.3. Extracting the history of a single file

You can use a negative selection to extract the history of a single file. Using the reposurgeon repository itself as an example:

read .
clone
delete path --not ci/prepare.sh
rename path :ci/(.*): "${1}"

After thus commands, the reposurgeon-clone repository will contain only the history of ci/prepare.sh, with the "ci" directory popped off the filename in the way indicated in the previous tip.

17.4. Use macros to avoid repeating boilerplate

We can encapsulate that technique for stripping off a directory as follows:

define pop rename path %{0}/(.*): "${1}"

You could then say:

do pop junkdir

Here’s a less trivial example: In CVS repositories of projects that use the GNU ChangeLog convention, a very common pre-conversion artifact is a commit with the comment “*** empty log message ***” that modifies only a ChangeLog entry explaining the commit immediately previous to it. The following

define changelog <%{0}> & /empty log message/ squash --pushback
do changelog 2012-08-14T21:51:35Z
do changelog 2012-08-08T22:52:14Z
do changelog 2012-08-07T04:48:26Z
do changelog 2012-08-08T07:19:09Z
do changelog 2012-07-28T18:40:10Z

is equivalent to the more verbose

<2012-08-14T21:51:35Z> & /empty log message/ squash --pushback
<2012-08-08T22:52:14Z> & /empty log message/ squash --pushback
<2012-08-07T04:48:26Z> & /empty log message/ squash --pushback
<2012-08-08T07:19:09Z> & /empty log message/ squash --pushback
<2012-07-28T18:40:10Z> & /empty log message/ squash --pushback

but you are less likely to make difficult-to-notice errors typing the first version.

(Also note how the text regexp acts as a failsafe against the possibility of typing a wrong date that doesn’t refer to a commit with an empty comment. This was a real-world example from the CVS-to-git conversion of groff.)

17.5. Handling username collisions

Occasionally when processing a CVS or Subversion repository you will encounter a case where, during two different spans of time, the same user ID was used by two different people for which you have full name-and-email attributions you want to apply.

As an example, suppose your history has two committers named "fred", one active before March 15 2010 and one after. The name and email fields in their commit attributions all look like "fred <fred>" with variable data following.

There’s no way you can discriminate by time in an authors file. But there’s a different technique that will work. Like this:

1..<2010-90-15T00:00:00Z> & /^fred$/C attribute =C|=A set "Fred Foonly" "fred1@foonly.org"
<2010-90-15T00:00:00Z>..$ & /^fred$/C attribute =C|=A set "Fred J. Muggs" "fred2@foonly.org"
/fred[12]/C list inspect

The third command allows you to check that the previous two produced a sane-looking result.

Part of the trick here is knowing that if you only set name and email fields with an attribute command the date field is left unchanged.

If you needed to handle more than one of these collisions, you might consider a slightly different approach that illustrates how to write and use a multi-line macro:

define splitAttribution {
1..<{1}> & /^%{0}$/C attribute =C|=A set "%{2}" "%{2}"
<{1}>..$ & /^%{0}$/C attribute =C|=A set "%{3}" "%{3}"
}
do splitAttribution fred 2010-90-15T00:00:00Z fred1 fred2

This depends on the fact that the the script arguments %{0} and %{1} get substituted before the regular-expression compiler interprets {}. You’d follow it by applying an attribution file containing entries for fred1 and fred2.

17.6. Use reposurgeon.el to cut down on repetitive handwork

An Emacs mode for editing msgout dumps ships with the reposurgeon distribution. It includes various useful commands for Gitifying comments and reference-lifting.

17.7. Avoid git-svn lossage with repocutter filtering

Subversion histories that have been touched by git-svn often have subtle damage in the metadata. One common class of problems comes from git-svn creating .gitignore and .git histories in the Subversion repository. Commits on these paths are best treated as junk and removed.

While they could be removed with a reposurgeon delete path command after the repository has been translated to Git, sometimes one sees a worse problem resulting from git-svn somehow creating invalid copy nodes involving .git/.gitignore. This can produce alarming log messages, or even cause a Subversion stream read to fail spuriously despite the actual Subversion part of the history being undamaged.

If you suspect this is happening, the following repocutter expunge command is your friend:

repocutter expunge "/.git$"  "/.gitignore$"

If you use this to pre-filter your repository dump it will remove the garbled copy nodes along with every other modification to matching path, without affecting anything in the actual native Subversion part of the history.

17.8. Prevent timestamp collisions with timequake

If you’re going to use msgout/msgin to batch-edit commit metadata, consider running the timequake command first to ensure that all commit timestamps are unique. This way it will be guaranteed that you can feed the output of msgout --id to msgin without problems due to matching multiple commits.

After running timequake it’s a good idea to do "=Q list" to check that it hasn’t had surprising effects.

18. Troubleshooting and bug reports

18.1. When reposurgeon crashes

If reposurgeon crashes on a Subversion stream dump that has not been altered or filtered in any way since being emitted by svnadmin dump, that is definitely a bug; please report it as such. To speed up the fix, follow the instructions on preparing a reduced test case.

You are more likely to see a crash when processing a stream that has been modified by filtering out unwanted commits, e.g. via select or expunge or sift commands. The most common problem is copyfrom operations referring to a source revision that has been removed. In theory reposurgeon should fall back to referencing the latest revision before the gap, but there are still occasional crashes the the reposurgeon maintainers have not successfully characterized.

18.2. Dealing with memory exhaustion

To do its job, reposurgeon needs to hold all of your history’s metadata in memory. That doesn’t mean the content part, but does mean all of the changeset attributions, comments, and tags. Given a large enough repository, this will overrun the RAM of a small machine. If this happens to you, your reposurgeon instance will die abruptly with an OOM (Out Of Memory) error while attempting to read in your repository.

It is extremely unlikely that this is due to a bug in reposurgeon. Before filing an issue about it, there is a procedure you should try. It consists of bisecting on the parameter the Go language runtime uses to control the frequency of garbage collection. You can set this using the environment variable GOGC or reposurgeon "gc" command.

GOGC defaults to 100, which instructs the runtime to garbage-collect when the heap size is 100% bigger than (i.e., twice) the size of the reachable objects after the last garbage collection. To increase the frequency of GC, usually resulting in a lower memory high-water mark, decrease that percentage threshold. To decrease gc frequency, increase the threshold so the runtime tolerates a larger heap.

To troubleshoot your OOM problem, bisect on this threshold to find the highest value that will avoid OOM. Start by cutting it to 50, then to 25, then to 12, then to 6. If you find a value that allows you to read to completion, you may want to try increasing it by a half interval (e,g. 50 to 75, 25 to 37, etc.) to get back some throughput.

If your repository won’t read in at GOGC=6 you have a real problem. Unfortunately, it’s not one the reposurgeon devteam can help you with; the correct solution to it is to do your conversion on a machine with more RAM and/or more swap configured. 64GB should be sufficient. The largest repository the reposurgeon devteam has ever seen (the history of GCC, 280K Subversion commits) fit on a 128GB machine with GOGC=30.

If you can’t read your history onto a 128GB machine with GOGC=30, then maybe the reposurgeon devteam ought to hear about it. That said, if you can find ways to make reposurgeon more efficient, we are eager to accept those patches, or even just a bug report with the details. It’s probable that there are some efficiency gains yet to be made.

18.3. Dealing with stalled conversions

Occasionally it will happen that a conversion on a particularly large or malformed repository seems to stall out, grinding endlessly without completing a conversion phase.

Reposurgeon’s execution time is dominated by cycles spent in the memory allocator and garbage collector. Thus, you can pay RAM to decrease running time - push GOGC up from its default of 100. If your conversion completes in reasonable time before your memory usage increases enough that reposurgeon gets killed by OOM, you win. Otherwise see the previous section about adding RAM and swap space.

A stallout is more difficult to troubleshoot than an OOM, and more likely to indicate an actual bug or algorithmic problem in reposurgeon. There are a couple of things you can do to make a good resolution more likely:

  1. Identify and report the phase in which the stallout occurs. Be aware that there is a known problem that phase C2 of the Subversion dump reader that really thrashes the allocator; that’s not reducible and we’re just going to tell you you have to throw more RAM and compute clocks at the problem.

  2. Use repobench to see if you can identify a revision that triggers the stall. The procedure for this is to use it to step your readlimit value up from zero until you see the runtime spike.

  3. As always, provide a stripped (and possibly obscured) dump of the repository for testing.

18.4. How to report bugs

Be aware that bugs can be due to problems in any of (a) reposurgeon itself, (b) the front end it’s using to read your repository, or (c) the back end that writes the modified repository. For example, when you have a bug in a CVS conversion it is more than likely to be a problem with the cvs-fast-export front end. In that case you need to read Reporting bugs in cvs-fast-export and follow those directions.

It is generally not possible to reproduce reposurgeon/repocutter bugs without a copy of the history on which they occurred. When you find a bug, please send the maintainers:

(a) An exact description of how observed behavior differed from expected behavior. If reposurgeon/repocutter died, include the backtrace.

(b) A git fast-import or Subversion dump file of the repository you were operating on, or in the case of CVS the whole repository. Alternatively a pointer to where it can be pulled from - though a static copy submitted with the bug is better is better because commits added afterwards can confuse diagnostics.

(c) A script containing the sequence of reposurgeon or repocutter commands that revealed the bug. If you were exploring interactively, remember that the "history" command exists and can dump your command history to a file.

(d) If you were using the standard-workflow Makefile generated by "repotool initmake", mention that in your bug report. If you modified the Makefile, include a copy with the bug report.

Please use the reposurgeon project’s issue tracker and attach these files. It’s helpful.

Are you seeing git die with a complaint about an unknown --show-original-IDs option? Upgrade your git; reposurgeon needs 1.19.2 or later.

18.4.1. Test case reduction

If you know how to reproduce the error, the best possible test case is a hand-crafted dump stream of minimal size with content that explains how it breaks the tool. Those are turned into regression tests instantly.

When you don’t know the cause of the error, ship the project a dump file derived from the real repository and a script containing the commands that triggered it. To speed up the debugging process so you can get an answer more quickly, there are some tactics you can use to reduce the bulk of the test case you send.

How to make dumps in Git: cd to your git repository and capture the output of "repotool export".

How to make dumps in Subversion: cd to the toplevel directory of the repository master - not a checkout of it. You can tell you’re in the right place if you see this:

$ ls
conf  db  format  hooks  locks  README.txt

Then run "repotool export", capturing the output.

The commands you will use for test-case reduction are reposurgeon and, on Subversion dumps, repocutter.

There is a cvsstrip tool in the cvs-fast-export distribution that performs an analogous stripping operation on CVS repositories. Please see the troubleshooting instructions included with that distribution to lean how to use it.

18.4.2. Replace the content blobs in the dump with stubs

The subcommand in both tools is 'strip'; it will usually cut the size of the dump by more than a factor of 10. Check that the bug still reproduces on the stripped dump; if it doesn’t, that would be unprecedented and interesting in itself.

If you are trying to maintain confidentiality about your code, sending me a stripped repo has the additional advantage that the code won’t be disclosed! The command preserves structure and metadata but replaces each content blob with a unique magic cookie.

If you don’t want to disclose even the metadata, you can do a repocutter "obscure" pass after the strip. This will mask file paths and developer names.

18.4.3. Truncate the dump as much as possible

Try to truncate the dump to the shortest leading section that reproduces the bug.

A reposurgeon error message will normally include a mark, event number, or (when reading a Subversion dump) a Subversion revision number. Use a selection-set argument to reposurgeon’s 'write' command, or the 'select' subcommand of repocutter, to pare down the dump so that it ends just after the error is triggered. Again, check to ensure that the bug reproduces on the truncated dump.

If the error message doesn’t tell you where the problem occurred, try a bisection process. Use the readlimit option to ignore the last half of the events in the dump; check to see if the bug reproduces. If it does, repeat; if it does not, throw out the last quarter, then the last eighth, and so forth. Keep this up until you can no longer drop events without making the bug go away.

Bisection is more effective than you might expect, because the kinds of repository malformations that trigger misbehavior from reposurgeon tend to rise in frequency as you go back in time. The largest single category of them has been ancient cruft produced by cvs2svn conversions.

18.4.4. Topologically reduce the dump

Next, topologically reduce the dump, throwing out boring commits that are unlikely to be related to your problem.

If a commit has all file modifications (no additions or deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. In a Subversion dump it also has to not have any property changes; in a git dump it must not be referred to by any tag or reset. Interesting commits are not boring, or have a not-boring parent or not-boring child.

Try using the 'reduce' subcommand of repocutter to strip boring commits out of a Subversion dump. For a git dump, look at reposurgeon’s "strip --reduce" command.

18.4.5. Prune irrelevant branches

Try to throw away branches that are not relevant to your problem. The 'expunge' operation on repocutter or the 'delete branch' command in reposurgeon may be helpful.

This is the attempted simplification least unlikely to make your bug vanish, so check that carefully after each branch deletion.

18.5. Know how to spot possible importer bugs

If your target VCS’s importer dies during a rebuild, try writing the repository content to a stream instead and importing the stream by hand. If the latter does not fail, the target VCS’s importer may be slightly buggy - but you have a workaround.

(This has been observed under git 2.5.0 with the result of a 'unite' operation on two repositories. The cause is unknown, as git dies suddenly enough to not leave a crash report.)

18.6. Benchmarking

A fair amount of effort has been expended to keep the run-time performance of reposurgeon as linear as possible. This is not an easy state to stay in; it is unfortunately quite simple to accidentally regress this without noticing.

To that end, there are some fairly simple scripts in the bench directory of the source distribution that can be used to check for this type of problem. repobench runs reposurgeon multiple times with a different readlimit each time, recording the run time and memory allocated at each iteration. Supply arguments specifying the svn dump file to read and the readlimit values to use like this:

./repobench your-dump-file.svn 1000 2000 20000

This reads your-dump-file.svn 10 times, with the readlimit set first to 1,000, then 3,000, etc, stepping up until it reaches 20,000.

This produces a .dat file which you can use with repobench -p, or repobench -o to produce graphs.

For an example, see oops.svg. This shows a graph made using a good revision that had linear performance, several made with revisions that introduced a regression that made performance quite non-linear, and the fix. You can easily tell the difference visually.

18.7. Incompatible language changes

Reposurgeon scripts are effectively never reused. Thus, incompatible changes to the command language don’t have a high cost in pain to users, and the maintainers feel free to make them whenever improvement seems possible. But just in case, such changes are recorded here.

In versions 4.38 and earlier, the DSL was substantially different The command-line parser was fragile and ad-hoc, without consistent application of double-quoting. there was no "view" or "clone" command. The =E and =X event selectors didn’t yet exist. There was a read --use-uuid option which has been retired. Negation was not yet supported in pathsets. The regex, path rename, branch rename, and tag rename commands used Python-style rather than Go-style match references. Macro argument formals did not yet have a % prefix. The split command syntax was incompatibly different. The "set" and "clear" commands didn’t take a subcommand verb. There were "logfile" "readlimit" command doing what "set logfile" and "set readlimit" now does. The "elapsed", "memory", "when", and "sizeof" subcommands of "show" used to be separate commands. Blob/tag/reset creation used to be "{blob|tag|reset} create", not "create {blob|tag|reset}"; similarly for delete, rename, and move. The command to rename repositories was just "rename", not "rename repo"; The "delete path" command was named "expunge". The subcommands of "list" were separate commands. There was a "script" command separate from "do". The "reference list" command was removed (it duplicated the =N delector); "references lift" became "stampify"; "timings" became "checkpoint"; "attribution" bnecomes "attribute". The syntax of the "remove" command was different and more complex. The "import" command was "incorporate". The "select" command was "sourcetype".

In versions 4.30 and earlier, the read command had a --branchify option. This has been retired, you are now expected to do the equivalent manipulations with repocutter.

The "tip" command present in 4.24 and earlier has been removed, because the concept it’s trying to compute is not clearly defined at any commit upstream of a fork point. There’s no evidence this was ever used in the wild.

In versions 4.24 and earlier the behavior of the "unite" command was incompatibly different in that it took master branch from the last repo in the list, renaming others. Now it leaves master from the first.

The blob command now takes an explicit mark, rather than creating a new blob :1 and renumbering all others as in 4.23 and earlier.

Filter syntax now uses subcommand verbs rather than the options of 4.23 and earlier.

In versions 4.23 and earlier, the stats command operated on named repositories rather than the currently-selected one.

In versions 4.23 and earlier, it was possible to redirect output from msgin to capture a mailbox of changed entries. This feature has been removed; msgin now sets =Q bits which can be used to generate the same report.

In versions 4.23 and earlier, "[branchify_opt]" read option was a separate command that needed to be invoked before the read and set hidden global context. There was a related branchmap command that has been retired because branch rename covers all its cases in a simpler way.

In versions 4.23 and earlier, the behavior of the branch command was incompatibly different; it did not require a "heads/" or "tags/" prefix on its operand and, because of that, could only operate on heads/ branches and not lightweight tags.

In versions 4.23 and earlier, several commands that now have the form "object verb selection" had the form "object selection verb". This includes branch, tag, reset, and path.

In versions 4.23 and earlier, the syntax of the expunge command was different; it used "~" instead of "--not" and took multiple patterns.

The "paths sup" and "paths sub" commands of versions 4.23 and earlier have been retired and replaced by the enhanced path rename command.

As of 4.23, branch references as used in "branch rename" and elsewhere now require a namespace prefix - heads/ or tags/. This was done to eliminate a number of special cases that were difficult to document and remember.

In versions before 4.10, the "reduce" and "blob" options of the "strip" command were bare keywords. Also the options of the "ignores" command were bare keywords. There was a command to set the prompt string that has been retired.

In versions before 4.8, the expunge command run on a repository named "foo" tried to keep deleted fileops in a new synthetic repository named "foo-expunges". This feature has been replaced by the "~" negation operator on expunge selections.

In versions before 4.1, the [index_cmd] command did not see blobs by default.

In versions before 4.0, msgin and msgout were named "mailbox_in" and "mailbox_out:"; --branchify was "branchify_map". Previous versions used the Python variant of regular expressions; some of the more idiosyncratic features of these are not replicated in the Go implementation.

In versions before 3.23, ‘prefer’ changed the repository type as well as the preferred output format. Since then, do this with ‘select’.

In versions before 3.0, the general command syntax put the command verb first, then the selection set (if any) then modifiers (VSO). It has changed to optional selection set first, then command verb, then modifiers (SVO). The change made parsing simpler, allowed abolishing some noise keywords, and recapitulates a successful design pattern in some other Unix tools - notably sed(1).

In versions before 3.0, path expressions only matched commits, not commits and the associated blobs as well. The names of the "a" and "c" flags were different.

In reposurgeon versions before 3.0, the delete command had the semantics of squash; also, the policy flags did not require a ‘--’ prefix. The ‘--delete’ flag was named "obliterate".

In reposurgeon versions before 3.0, read and write optionally took file arguments rather than requiring redirects (and the write command never wrote into directories). This was changed in order to allow these commands to have modifiers. These modifiers replaced several global options that no longer exist.

In reposurgeon versions before 3.0, the earliest factor in a unite command always kept its tag and branch names unaltered. The new rule for resolving name conflicts, giving priority to the latest factor, produces more natural behavior when uniting two repositories end to end; the master branch of the second (later) one keeps its name.

In reposurgeon versions before 3.0, the tagify command expected policies as trailing arguments to alter its behavior. The new syntax uses similarly named options with leading dashes, that can appear anywhere after the tagify command.

In versions before 2.9. the syntax of authors, legacy, list, and what are now msgin and msgout was different (and legacy was named fossils). They took plain filename arguments rather than using redirect < and >.

In versions so old that the changeover point is now lost in the mists of time, curly brackets (not parens) performed sub-expression grouping.

18.8. Emergency help

If you need emergency help, go to the #reposurgeon IRC on irc.oftc.net. Be aware, however, that the maintainer is too busy to babysit difficult repository conversions unless he has explicitly volunteered for one or someone is paying him to care about it. For explanation, see Your money or your spec.

19. Stream syntax extensions

The event-stream parser in reposurgeon supports some extended syntax. Exporters designed to work with reposurgeon may have a --reposurgeon option that enables emission of extended syntax; notably, this is true of cvs-fast-export(1). The remainder of this section describes these syntax extensions. The properties they set are (usually) preserved and re-output when the stream file is written.

The token ‘#reposurgeon’ at the start of a comment line in a fast-import stream signals reposurgeon that the remainder is an extension command to be interpreted by reposurgeon.

One such extension command is implemented: ‘sourcetype’, which behaves identically to the reposurgeon select command. An exporter for a version-control system named "frobozz" could, for example, say

#reposurgeon sourcetype frobozz

Within a commit, a magic comment of the form ‘#legacy-id’ declares a legacy ID from the stream file’s source version-control system.

Also accepted is the bzr/brz syntax for setting per-commit properties. While parsing commit syntax, a line beginning with the token ‘property’ must continue with a whitespace-separated property-name token. If it is then followed by a newline it is taken to set that boolean-valued property to true. Otherwise it must be followed by a numeric token specifying a data length, a space, following data (which may contain newlines) and a terminating newline. For example:

commit refs/heads/master
mark :1
committer Eric S. Raymond <esr@thyrsus.com> 1289147634 -0500
data 16
Example commit.

property legacy-id 2 r1
M 644 inline README

Unlike other extensions, bzr/brz properties are only preserved on stream output if the preferred type is bzr or brz, because any importer other than those will choke on them.

20. Limitations and guarantees

Guarantee: In DVCSes that use commit hashes, editing with reposurgeon never changes the hash of a commit object unless (a) you edit the commit, or (b) it is a descendant of an edited commit in a VCS that includes parent hashes in the input of a child object’s hash (git and hg both do this).

Guarantee: reposurgeon only requires main memory proportional to the size of a repository’s metadata history, not its entire content history. (Exception: the data from inline content is held in memory.)

Guarantee: In the worst case, reposurgeon makes its own copy of every content blob in the repository’s history and thus uses intermediate disk space approximately equal to the size of a repository’s content history. However, when the repository to be edited is presented as a stream file, reposurgeon requires no or only very little extra disk space to represent it; the internal representation of content blobs is a (seek-offset, length) pair pointing into the stream file.

Guarantee: reposurgeon never modifies the contents of a repository it reads, nor deletes any repository. The results of surgery are always expressed in a new repository.

Guarantee: Any line in a fast-import stream that is not a part of a command reposurgeon parses and understands will be passed through unaltered. At present the set of potential passthroughs is known to include the progress, options, and checkpoint commands as well as comments led by #.

Guarantee: All reposurgeon operations either preserve all repository state they are not explicitly told to modify or warn you when they cannot do so.

Guarantee: reposurgeon handles the bzr/brz commit-properties extension, correctly passing through property items including those with embedded newlines. (Such properties are also editable in the message-box format.)

Limitation: Because reposurgeon relies on other programs to generate and interpret the fast-import command stream, it is subject to bugs in those programs.

Limitation: bzr/brz suffers from deep confusion over whether its unit of work is a repository or a floating branch that might have been cloned from a repo or created from scratch, and might or might not be destined to be merged to a repo one day. Its exporter only works on branches, but its importer creates repos. Thus, a rebuild operation will produce a subdirectory structure that differs from what you expect. Look for your content under the subdirectory ‘trunk’.

Limitation: under git, signed tags are imported verbatim. However, any operation that modifies any commit upstream of the target of the tag will invalidate it.

Limitation: Stock git (at least as of version 1.7.3.2) will choke on property extension commands. Accordingly, reposurgeon omits them when rebuilding a repo with git type.

Limitation: Converting an hg repo that uses bookmarks (not branches) to git can lose information; the branch ref that git assigns to each commit may not be the same as the hg bookmark that was active when the commit was originally made under hg. Unfortunately, this is a real ontological mismatch, not a problem that can be fixed by cleverness in reposurgeon.

Limitation: Converting an hg repo that uses branches to git can lose information because git does not store an explicit branch as part of commit metadata, but colors commits with branch or tag names on the fly using a specific coloring algorithm, which might not match the explicit branch assignments to commits in the original hg repo. Reposurgeon preserves the hg branch information when reading an hg repo, so it is available from within reposurgeon itself, but there is no way to preserve it if the repo is written to git.

Limitation: Not all BitKeeper versions have the fast-import and fast-export commands that reposurgeon requires. They are present back to the 7.3 open-source version.

Limitation: reposurgeon may misbehave under a filesystem which smashes case in filenames, or which nominally preserves case but maps names differing only by case to the same filesystem node (Mac OS X behaves like this by default). Problems will arise if any two paths in a repo differ by case only. To avoid the problem on a Mac, do all your surgery on an HFS+ file system formatted with case sensitivity specifically enabled.

Guarantee: As version-control systems add support for the fast-import format, their repositories will become editable by reposurgeon.

21. Notes on the DSL

The syntax of the reposurgeon DSL evolved ad-hoc as more capabilities were glued onto the tool. The result has been something of an irregular mess that is difficult for uses to wrap their heads around even with good documentation.

Wrestling the DSL into a simpler and more uniform shape is an ongoing area of development, and not a simple one. There are conflicting objectives:

  1. Simplify individual commands with complex syntax.

  2. Most commands have the form of a simple verb. (V). Some have the form verb-subject (VS) or verb-subject-object (VSO). Some have the form subject-verb (SV). Some are subject nouns (S). We want the simplest and most uniform morphology possible.

  3. Group commands so the language is easier to remember and document.

  4. Use subcommand keywords for alternative modes, as opposed to policy bits that can be set several at a time.

  5. Keep commonly-used commands short and simple.

A relatively recent advance at the end of the 4.x release series was implementing uniform argument-tokenization rules across all commands (with one motivated exception at "filter shell"). This replaced differing tokenization rules in different commands.

The state of play…​

Most commands are V:

{SELECTION} add { "D" {PATH} | "M" {PERM} {MARK} {PATH} | "R" {SOURCE} {TARGET} | "C" {SOURCE} {TARGET} }
SELECTION append [--rstrip] {TEXT}
{SELECTION} assign [--singleton] [NAME]
branchlift SOURCEBRANCH PATHPREFIX [NEWNAME]
{SELECTION} checkout DIRECTORY
checkpoint [MARK-NAME] [>OUTFILE]
choose [REPO-NAME]
[SELECTION] coalesce [--changelog] [--debug] [TIMEFUZZ]
{SELECTION} count [>OUTFILE]
debranch SOURCE-BRANCH [TARGET-BRANCH]
[SELECTION] dedup
define [NAME [TEXT]]
{SELECTION} diff [>OUTFILE]
{SELECTION} divide
do MACRO-NAME [ARG...]
drop [REPO-NAME]
exit [>OUTFILE]
gc [GOGC]
[SELECTION] gitify
[SELECTION] graft [--prune] REPO-NAME
[SELECTION] graph [>OUTFILE]
[SELECTION] hash [--tree] [>OUTFILE]
help [COMMAND]
history
{SELECTION} import [--date=YY-MM-DDTHH:MM:SS|--after|--firewall] [TARBALL...]
[SELECTION] lint [--OPTION...] [>OUTFILE]
log [[+-]LOG-CLASS]...
{SELECTION} merge
[SELECTION] msgin [--create] [<INFILE]
[SELECTION] msgout  [--decode=codec] [--filter=PATTERN] [--blobs]
prefer [VCS-NAME]
SELECTION prepend [--rstrip] {TEXT}
preserve [PATH...]
print [TEXT...] [>OUTFILE]
quit
read [--quiet] [<INFILE | - | DIRECTORY]
rebuild [DIRECTORY]
[SELECTION] remove {INDEX | ["D"|"M"|"R"|"C"|"N"] [PATH]} [to TARGET]
renumber
[SELECTION] reorder [--quiet]
{SELECTION} resolve
select [VCS-NAME]
[SELECTION] setfield FIELD VALUE
{SELECTION} setperm PERM [PATH...]
shell [COMMAND-TEXT]
[SELECTION] split [ --path ] PATH-OR-INDEX
[SELECTION] squash [POLICY-FLAGS...]
[SELECTION] stampify
[SELECTION] strip {--reduce|--blobs|--obscure}
[SELECTION] tagify [ --tagify-merges | --canonicalize | --tipdeletes ]
[SELECTION] transcode ENCODING
unassign NAME
undefine MACRO-NAME
unite [--prune] [REPO-NAME...]
{SELECTION} unmerge
unpreserve [PATH...]
view [directory]
[SELECTION] write [--legacy] [--noincremental] [--callout] [>OUTFILE|-|DIRECTORY]

VS:

[SELECTION] attribute [ATTR-SELECTION] SUBCOMMAND [ARG...]
clear flag [canonicalize|crlf|compress|echo|experimental|interactive|progress|serial|faketime|quiet]+
clear {logfile|readlimit|desclimit}
[SELECTION] create {repo NAME|blob NAME [<INFILE]|tag NAME|reset NAME}
{SELECTION} delete {commit | {path|tag|branch|reset} [--quiet|--not|--notagify] PATTERN]}
[SELECTION] filter {dedos|shell|regexp|replace} [TEXT-OR-REGEXP]
[SELECTION] list [--decode=codec] [commits|tags|stamps|inspect|index|manifest|paths|names] [PATTERN] [>OUTFILE]
profile {live|start|save|bench} [PORT | SUBJECT [FILENAME]]
set flag [canonicalize|crlf|compress|echo|experimental|interactive|progress|serial|faketime|quiet]+
set {logfile|readlimit|desclimit} VALUE
show {elapsed|memory|sizeof|when TIMESTAMP} [>OUTFILE]

VSO:

[SELECTION] move {tag|reset} [--not] PATTERN [NEW-NAME]
[SELECTION] remove {INDEX | ["D"|"M"|"R"|"C"|"N"] [PATH]} [to TARGET]
[SELECTION] rename {repo | path PATTERN [--force] | {path|branch|tag|reset} [--not] PATTERN}} NEW-NAME

The following are SV. Inverting these to verb-first might be a good thing. Unfortunately this would make the embedded help for the read and write verbs a big heterogenous mess.

[SELECTION] authors {read <INFILE | write >OUTFILE}
legacy {read [<INFILE] | write [>OUTFILE]}

The following are S, and there’s no obvious way to move them to a verb-first form:

[SELECTION] changelogs [BASENAME-PATTERN]
ignores [--translate] [--defaults]
[SELECTION] timeoffset {OFFSET}
[SELECTION] timequake [--tick]
version [EXPECT]

22. Credits

These are in roughly descending magnitude.

Eric S. Raymond <esr@thyrsus.com>

Designer and original author.

Julien "FrnchFrgg" RIVAUD <frnchfrgg@free.fr>

Lots of high-quality code cleanups and speed tuning. Responsible for at least half of the massively revamped Subversion dump reader on the 4.0 releases. Ported the CoW filemaps from Python to Go.

Daniel Brooks <db48x@db48x.net>

Date unit testing, improvements for split and expunge commands. Assistance on Python to Go port. Go profiling support. Several significant reductions in total run time, total allocations, and max heap usage.

Greg Hudson <ghudson@MIT.EDU>

Contributed copy-on-write filemaps, which both tremendously sped up Subversion dumpfile parsing and squashed a nasty bug in the older code. While his CoW implementation was eventually replaced with one by Julien Rivaud, it busted the project out of a nearly two-year slump.

Eric Sunshine <sunshine@sunshineco.com>

Review of seldom-used features, test improvements, bug-fixing. Generalized selection expression parser for use-cases other than events. Converted selection parser, which evaluated an expression while parsing it, to a compile/evaluate paradigm in which a selection expression can be compiled once and evaluated many times. Added 'attribute' command. Added 'reorder' command. Assisted Python to Go port.

Edward Cree <ec429@cantab.net>

Wrote the Hg extractor class and its test.

Ian Bruene <ianbruene@gmail.com>

Wrote the kommandant package and the Go port of Python difflib in order to support this package.

Chris Lemmons <alficles@gmail.com>

Solved some problems with inline blobs, improved interoperability with Mercurial, wrote the --prune option for graft.

Richard Hansen <rhansen@rhansen.org>

Selections as ordered rather than compulsorily sorted sets. The generalized reparent command. Improvements in regression-test infrastructure.

Peter Donis <peterdonis@alum.mit.edu>

Python 3 port and Python2/3 interoperability. Historical: none of this survived the port to Go.

Appendix A: The ontological-mismatch problem and its consequences

There are many tools for converting repositories between version-control systems out there. This appendix explains why reposurgeon is the best of breed by comparing it to the competition.

The problems other repository-translation tools have come from ontological mismatches between their source and target systems - models of changesets, branching and tagging can differ in complicated ways. While these gaps can often be bridged by careful analysis, the techniques for doing so are algorithmically complex, difficult to test, and have ugly edge cases.

Furthermore, doing a really high-quality translation often requires human judgment about how to move artifacts - and what to discard. But most lifting tools are, unlike reposurgeon, designed as run-it-once batch processors that can only implement simple and mechanical rules.

Consequently, most repository-translation tools evade the harder problems. They produce a sort of pidgin rendering that crudely and partially copies the history from the source system to the target without fully translating it into native idioms, leaving behind metadata that would take more effort to move over or leaving it in the native format for the source system.

But pidgin repository translations are a kind of friction drag on future development, and are just plain unpleasant to use. So instead of evading the hard problems, reposurgeon provides a power assist for a human to tackle them head-on.

Here are some specific symptoms of evasion that are common enough to deserve tags for later reference.

LINEAR: One very common form of evasion is only handling linear histories.

NO_IGNORES: There are many different mechanisms for ignoring files - .cvsignore, Subversion svn:ignore properties, .gitignore and their analogues. Many older Subversion repositories still have .cvsignore files in them as relics of CVS prehistory that weren’t translated when the repository was lifted. Reposurgeon, on the other hand, knows these can be changed to .gitignore files and does it.

NO_TAGS: Many repository translators cannot generate annotated tags (or their non-git equivalents) even when that would be the right abstraction in the target system.

CONFIGURATION: Another common failing is for repository-translation tools to require a lot of configuration and ceremony before they can operate. Often, for example, tools that translate from Subversion repositories require you to declare the repository’s branch structure every time even though sensible defaults and a bit of autodetection could have avoided this.

MIXED_BRANCH: Yet another case usually handled poorly (in translators that handle branching) is mixed-branch commits. In Subversion it is possible (though a bad idea) to commit a changeset that modifies multiple branches at once. All sufficiently old Subversion repositories have these, often by accident. The proper thing to do is split these up; the usual thing is to assign them to one branch and leave them omitted from the others.

Version references in commit comments. It is not uncommon to see a lot of references that are no longer usable embedded in translated repositories like fossils in geological strata - file-version numbers like '1.2' in Subversion repos that had a former life in CVS, Subversion references like 'r1234' in git repositories, and so forth. There’s no tag for this because tools other than reposurgeon generally have no support at all for lifting these.

To avoid repetitive text in these descriptions, we use the following additional bug tags:

ABANDONED: Effectively abandoned by its maintainer. Some tools with this tag are still nominally maintained but have not been updated or released in years.

NO_DOCUMENTATION: Poorly (if at all) documented.

!FOO means the tool is known not to have problem FOO.

?FOO means the author has not tried the tool but has strong reason to suspect the problem is present based on other things known about it.

You should assume that none of these tools do reference-lifting.

A.1. cvs2svn

Just after the turn of the 21st century, when Subversion was the new thing in version control, most projects that were using version control were using CVS, and cvs2svn was about the only migration path.

Early cvs2svn had problems on every level, only some of which have been fixed by more recent releases. It tended to spew junk commits into the translated history, and produced strange combinations of Subversion internal operations that most later translation tools would cope with only poorly. Sometimes the resulting translations are actually malformed; more often they contains noisy commits or commit duplications that made little sense under Subversion and make even less under the new target system.

As of late 2023, cvs2svn is "is now in maintenance mode and is not actively being developed".

!LINEAR, ?MIXED_BRANCH, DOCUMENTATION

A.2. cvs-fast-export

Formerly named parsecvs. Originally written by Keith Packard to port the X.org repositories, which it did a good job on. Now maintained by me; reposurgeon uses it to read CVS repositories. It is extremely fast and can thus be productively used even on huge repositories.

cvs-fast-export is under active maintenance.

!ABANDONED, !LINEAR, !NO_IGNORES, !DOCUMENTATION, !CONFIGURATION

A.3. cvsps

Don’t use this. Just plain don’t. The author maintained version 3.x until deprecating it in favor of cvs-fast-export due to fundamental, unfixable problems. It gets branch topology wrong in ways that are difficult to detect.

ABANDONED

A.4. git-svn

git-svn, the Subversion converter in the git distribution, is really designed to be a two-way live gateway enabling git users to push and pull commits from a Subversion server. It operates by creating a git repository that is effectively a local mirror of the Subversion history, then performing Subversion client commands to synchronize the two in a git-like way.

This choice of mission means that git-svn’s translation of history into git uses a compromise between Subversion idioms and git ones that is more designed to make transactions back to the Subversion server easy and safe to generate than it is to make full use of the git capabilities that Subversion doesn’t have. This is pidgin translation for a reason better than laziness or failure of nerve, but it’s still pidgin.

Worse, git-svn has bugs that severely compromise it for full translations. It tends to stumble over common repository malformations in Subversion, producing history damage that is significant but evades superficial scrutiny. The author has written about this problem in detail at Don’t do svn-to-git repository conversions with git-svn!

For a straight linear history with no tags or branches, the difference between git-svn’s Subversion-emulating behavior and the way a git repository would most naturally be structured is minimal. But for conformability with Subversion, git-svn cannot (practically speaking) use git’s annotated-tag facility in the local mirror; instead, Subversion tags have to be represented in the local mirror as git branches even if they have no changes after the initial branch copy.

Another thing the live-gatewaying use case prevents is reference-lifting. Subversion references like "r1234" in commit comments have to be left as-is to avoid creating pain for users of the same Subversion remote not going through git-svn.

git-svn was used by both Google Code’s exporter and is used in GitHub’s importer web services. Depending on the latter is not recommended.

!ABANDONED, MIXED_BRANCH, NO_TAGS, NO_IGNORES.

A.5. git-svnimport

Formerly part of the git suite; what they had before git-svn, and inferior to it. Among other problems, it can only handle Subversion repos with a "standard" trunk/tags/branches layout. Now deprecated.

MIXED_BRANCH, NO_TAGS, NO_IGNORES, ABANDONED.

A.6. git-svn-import

A trivial wrapper around git-svn. All the reasons not to use git-svn apply to it as well. As of late 2023 has not seen an update in ten years.

MIXED_BRANCH, NO_TAGS, NO_IGNORES, ABANDONED.

A.7. svn-fe

svn-fe was a valiant effort to produce a tool that would dump a Subversion repository history as a git fast-import stream. It made it into the git contrib directory at one point, but seems to have been dropped.

LINEAR, NO_TAGS, NO_IGNORES, ABANDONED.

A.8. Tailor

Tailor aimed to be an any-to-any repository translator. As of late 2023 the last commit was in 2011.

LINEAR, ?NO_IGNORES, ABANDONED.

A.9. agito

This is a Subversion-to-git tool that was written to handle some cases that git-svn barfs on (but reposurgeon doesn’t - the reposurgeon test suite contains a case sent by agito’s author to check this). It even handles mixed-branch commits correctly.

!LINEAR, !NO_TAGS, !MIXED_BRANCH, CONFIGURATION, ABANDONED.

If you could not use reposurgeon for some reason, this was one of the best alternatives. However, as of 2023 it appears to have been abandoned in 2016.

A.10. svn2git (jcoglan/nirvdrum version)

A batch-conversion wrapper around git-svn that creates real tag objects. This is the one written in Ruby. Has all of git-svn’s problems, alas.

!ABANDONED, !NO_TAGS, NO_IGNORES, ABANDONED.

A.11. svn2git (Schemenauer version)

Native Python. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.12. svn-all-fast-export (nyblom version)

Written in C++. Says it’s also named svn2git and might be genetically related to Nyblom svn2git. Was used to convert the KDE subversion repository to Git. Looks like it might not be terrible, but requires very elaborate configuration to make Subversion branch structure to git branch structure. Actively maintained in 2023.

CONFIGURATION, !ABANDONED.

A.13. svn-fast-export

Written in C. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.14. svn-dump-fast-export

Written in C. Documentation is so lacking that there isn’t even a README. However, it’s possible to deduce what isn’t there by reading the code. In 2023 this effectively moved into the Git tree.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.15. svn-all-fast-export (thiago version)

May be genetically related to the Nyblom svn2git, but if so they diverged in 2008. As of 2023,

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.16. sccs2git

There is a script called sccs2git on CPAN which attempts to lift SCCS collections direct to Git. It is not recommended, as it is poorly documented and makes no attempt to group commits into changesets.

NO_DOCUMENTATION, ABANDONED.

A.17. SubGit

Nearly unique for this category of software in being closed-source. Beyond an evaluation period, users have to register, possibly for a cost (it’s supposed to be free-of-charge for certain uses: open source projects, education, and "startups" — history with BitKeeper shows that such exemptions should probably not be trusted).

The intended outcome of this program is to provide a server with support for both Subversion and Git users to interact at once. This may be of little value overall, as new developers are frequently unfamiliar with Subversion (and old ones forget the usage patterns!), fundamental differences in design of the two VCSes interfering with the quality of both views, and increased confusion with preferred modes of contribution arise.

The quality of SubGit’s conversion is rather poor. It fails to properly translate at least half of the reposurgeon *.svn regression tests, even some of the simpler ones - although trickier cases such as agito.svn it does translate correctly. Large real-world Subversion repos will exhibit multiple issues that SubGit may, silently or otherwise, trip over.

This program will forever contain compromises for the same reasons git-svn does. The non-open source nature leaves little hope of having such issues repaired by skilled community members.

Atlassian’s BitBucket service relies on this for Subversion-to-Git migration. Depending on this service is not recommended.

!MIXED_BRANCH, !LINEAR, CONFIGURATION, DOCUMENTATION

Appendix B: A tour of the codebase

Reposurgeon is intended to be hackable to add support for special-purpose or custom operations, though it’s even better if you can come up with a new surgical primitive general enough to ship with the stock version. For either case, here’s a guide to the architecture.

B.1. inner.go

The core classes in inner.go support deserializing and reserializing import streams. In between these two operations the repo state lives in a fairly simple object, Repository. The main part of Repository is just a list of objects implementing the Event interface - Commits, Blobs, Tags, Resets, and Passthroughs. These are straightforward representations of the command types in a Git import stream, with Passthrough as a way of losslessly conveying lines the parser does not recognize.

 +-------------+    +---------+    +-------------+
 | Deserialize |--->| Operate |--->| Reserialize |
 +-------------+    +---------+    +-------------+

The general theory of reposurgeon is: you deserialize, you do stuff to the event list that preserves correctness invariants, you reserialize. The "do stuff" is mostly not in the core classes, but there is one major exception. The primitive to delete a commit and squash its fileops forwards or backwards is seriously intertwined with the core classes and actually makes up almost 50% of Repository by line count.

The rest of the surgical code lives outside the core classes. Most of it lives in the Reposurgeon class (the command interpreter) or the RepositoryList class (which encapsulated by-name access to a list of repositories and also hosts surgical operations involving multiple repositories). A few bits, like the repository reader and builder, have enough logic that’s independent of these classes to be factored out of it.

In designing new commands for the interpreter, try hard to keep them orthogonal to the selection-set code. As often as possible, commands should all have a similar form with a (single) selection set argument.

VCS is not a core class. The code for manipulating actual repos is bolted on the the ends of the pipeline, like this:

 +--------+    +-------------+    +---------+    +-----------+    +--------+
 | Import |--->| Deserialize |--->| Operate |--->| Serialize |--->| Export |
 +--------+    +-------------+ A  +---------+    +-----------+    +--------+
      +-----------+            |
      | Extractor |------------+
      +-----------+

The Import and Export boxes call methods in VCS.

B.2. extractor.go

Extractor classes build the deserialized internal representation directly. Each extractor class is a set of VCS-specific methods to be used by the RepoStreamer driver class. Key detail: when a repository is recognized by an extractor it sets the repository type to point to the corresponding VCS instance.

B.3. reposurgeon.go

All code that knows about the DSL syntax should live in reposurgeon.go along with the program main and the functions for reporting errors, logging, handling signals and aborts, etc.

B.4. svnread.go

This is the reader for Subversion dumpfiles. It is the only exception to the rule that reads support for version control systems is implemented by front ends that read them and emit a fast-import stream.

The reason it’s an exception is that Subversion has its own serialization format, and the total complexity of embedding support for those streams was estimated to be lower than that if writing a a completely separate front end.

B.5. Style notes

The code was translated from Python. It retains, for internal documentation purposes, the Python convention of using leading underscores on field names to flag fields that should never be referenced outside a method of the associated struct.

The capitalization of other fieldnames looks inconsistent because the code tries to retain the lowercase Python names and compartmentalize as much as possible to be visible only within the declaring package. Some fields are capitalized for backwards compatibility with the setfield command in the Python implementation, others (like some members of FileOp) because there’s an internal requirement that they be settable by the Go reflection primitives.

Appendix C: Adding support for more version-control systems

The best way way to add support for a version-control system not already on the list is to write a pair of foo-fast-export and foo-fast-import utilities (separate from reposurgeon) that generate and consume git fast-import streams. When this is achievable, it enables full read/write support for repositories of that type. In this case the supporting changes in reposurgeon will be trivial, just a pair of table entries.

The next best route is to write a FooExtractor class in reposurgeon itself. This is less good because (a) it provides only read-side support, and (b) it adds complexity to reposurgeon. There’s also a filter derived from testing requirements; a command-line client of your FooVCS must be freely available running under Unix in order for the reposurgeon maintainers to run tests on it. We are not willing to ship features we can’t test.

Finally, if your VCS supports a native serialization format that it can use as a dump/restore for live repositories, and has or both of a pair of foo-dump/foo-load utilities analogous to git-fast-export and git-fast-import, it may be possible to support your FooVCS through that format. Subversion’s svnadmin dump/load commands fit this pattern.

In this case it is still best to try to write filters that interconvert between the native serialization and git-fast-import streams, separately from reposurgeon. This makes the testing problem more tractable, and means that reposurgeon itself needs only a couple of additional table entries calling simple pipelines.

As a last resort, the reposurgeon maintainers may consider adding support for reading and writing a native serialization format to reposurgeon itself. So far this has only been done once, for Subversion, and there is an important precondition; the serialization format must have complete public documentation.

Be aware that proprietary VCSes in general are likely to cause us serious testing problems and we are reluctant to try to support them. If a maintainer has to pay money to have binaries he or she can run tests with, you will have to pay a maintainer money to make that happen.

It’s also basically a crash landing if your FooVCS can only be accessed through a GUI, or its clients only run on Windows, or it has a CLI that is not capable enough to support an extractor class. We know of cases where proprietary VCS vendors have deliberately crippled their export and CLI features in order to lock customers in; that is no fun to deal with, so you’ll have to pay somebody money.

Appendix D: Reposurgeon success stories

Reposurgeon has been used for successful conversion on projects including but not limited to the following. These are in rough chronological order.

Hercules (IBM mainframe emulator)

The author did this one, Subversion to hg. About ten years of history at the time, not too horribly messy.

NUT (Network UPS Tools)

The author did this one, Subversion to git. The trial by fire - it was when the Subversion dump analyzer got built. Very large old repository with lots of pathologies (there was a CVS stratum).

Battle For Wesnoth

The author did this one, Subversion to git. Very large repo, moderately complex.

Roundup (issue tracker)

The author did this one, Subversion to git (they later switched to hg). Moderate-sized Subversion repo with some very strange malformations.

robotfindskitten

The author did this one, CVS to git. Simple history, pretty easy.

Blender

Two guys at Blender did this one with help from the author, Subversion to git. Huge repository with a lot of nasty pathologies. The tool needed some serious optimization and feature upgrades to handle it.

groff

The author did this one, CVS to git. Rather easy as the project history was almost linear and, though very old, not huge.

Nethack

CVS to git. This conversion has not yet been publicly released at time of writing (late October 2014) for complicated political reasons.

Emacs

A record three layers, Bazaar over CVS over RCS. Malformations not too bad except for some unique challenges created by the RCS-to-CVS conversion, but the sheer size of the history and number of layers makes it the most complex conversion yet. Converted in 2011.

ntp

The author did this, BitKeeper to git using a derivative of Tridge’s SourcePuller as a front end, done in early 2015. Nothing especially taxing about the reposurgeon side of things, the magic was all in the front end.

pdfrw, playtag, pyeda, rson

Four small Subversion projects by Patrick Maupin, converted in two hours' work in May 2015. No significant difficulties. These mainly served to demonstrate that the standard conversion workflow in conversion.mk is fast and effective for a wide range of projects.

mh-e

The Emacs interface for MH. Converted by Bill Wohler in late 2015. He reports that the standard conversion workflow worked fine.

GNUPLOT

CVS to git, 30 years of history with some early releases recovered from tarballs. Converted by the author in late 2017. Somewhat messy due to vendor-branch issues.

GCC

SVN to git, with ancient strata of CVS and RCS. 280K commits of history back to 1987, dwarfing Emacs. Converted by myself and two core GCC developers. The 4.0 release came out of this. Final cutover was on Jan 12th 2020.

Here are some other some other field reports on successful uses:

Appendix E: Development History

Links to notable blog posts during the development of reposurgeon. Trivial release announcements have been omitted.

Cometary Contributors (2016-01-10)

30 Days in the Hole (2020-01-24)

Two graceful finishes (2020-05-13)