Name

comparator, filterator — fast location of common code in large source trees

Synopsis

comparator [-c] [-C] [-d dir] [-h] [-m minsize] [-n] [-N normalization-spec] [-o file] [-s shredsize] [-v] [-w] [-x] source-tree-path...

filterator [-d dir] [-m minsize]

comparator.py

DESCRIPTION

comparator is a program for quickly finding common sections in two or more source-code trees. It's named after a now-obsolete astronomical instrument called a blink comparator that was used to find moving objects in starfield photos. These reports tend to be noisy; filterator is a postprocessor for comparator reports that can apply significance filtering. comparator.py is a Python module for interpreting comparator reports; filterator uses it.

How comparator works

comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing a hash of each shred. The resulting list of shreds is examined and all unique hashes are thrown out. Then comparator generates a report listing all cliques of lines with duplicate hashes, with overlapping ranges merged. The output format can (not by coincidence) be stepped through with the next-error function of Emacs.

A consequence of the method is that comparator will find common code segments wherever they are. It is insensitive to rearrangements of the source trees. With appropriate options, code can be normalized at input to ignore differences in whitespace and (in C code) brace placement. An option to strip comments also exists.

The program is capable of generating a hash list for a source code tree which can be used in place of the tree itself. Thus, it is possible to do comparisons with source trees without having access to the actual source code, providing someone who has access is willing to ship you a hash list. These hash lists are called ‘SCF files’ and made with the extension .scf; the name stands for Source Comparison Format.

In its normal mode of operation, comparator expects two or more command-line arguments which are the root directories of the source-code trees to be compared, and generates a report to standard output. Progress messages are emitted to standard error. Some arguments may be the names of SCF files.

When given a single path argument which is a tree, the program generates an SCF hash list to standard output.

When the -c option is specified, the program generates a SCF file from each tree specified on the command line. The name of the resulting file is the argument name with .scf appended.

The -o option directs output to a specified file. This may be used with -c to override the normal .scf convention, or simply as an alternative to output redirection when generating a final report. One advantage of -o is that the program won't create the output file until it is ready to write; thus, unlike shell redirects, it won't leave an empty file lying aeound if it aborts early.

The -d option changes current directory to the specified tree before walking down each argument path to generate hashes. This will be useful if you want to generate a report for trees in a directory other than your current, but want the filenames in the report to be relative.

The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.

The -m option sets the minimum-sized span of lines that will be output. By default this is zero; setting it to a value higher than the shred size will eliminate a lot of junk from the report. These are two separate parameters because shred size controls how likely common segments are to not get recognized because of unshared things near them, while minimum span size defines the smallest features we want to see in the output report.

Normally, comparator performs significance filtering before emitting a span into the common-segment report. The -n suppresses this, emitting all common spans. See the discussion of significance filtering below.

The option -v enables progress and timing messages to standard error.

The -x enables some debugging output. It is for developers; if you need to understand it, read the code.

comparator uses a grotty heuristic for figuring out which files in a tree are eligible to be compared. Files with .c and .h extensions are always eligible. Files with .o or ~ extensions are always ignored. CVS, RCS, SCCS, and .svn directories are ignored. Other files are checked to see if the percentage of printable characters near the front of the file is above 90%.

Normalization specifications

The -N option passes in its argument as a normalization specification. A normalization specification consists of a comma-separated list of tokens. The first token names a normalizer, e.g., a set of rules for transforming code into a canonical form. Subsequent tokens (if present) are options to the normalizer.

The line-oriented normalizer

The default normalizer is named by the leading token line-oriented. It has the following options:

The remove-whitespace causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct).

The remove-braces option strips { and } characters. This is recommended for comparing C code; it prevents the comparison from being fooled by differences in indent style.

The remove-comments option removes comments in C files, All // comments are removed, but only "winged" /* comments with the start and end of comment on the same line are ignored; also, block /* comments are retained, but the beginning /* and ending */ are ignored, In other files, all # comments are removed.

Significance filtering

Normally, comparator tries to throw out common segments that are just noise. A segment is considered noise if one of the following conditions holds:

  1. Its file has a .c or .h extension, and the segment consists entirely of whitespace, punctuation, C keywords, C include lines, and other things that are not C identifiers, function names, or user-defined macros.

  2. The file has a .sh extension or a #! first line directing shell interpretation, and the segment consists entirely of shell keywords and punctuation.

Significance filtering can be disabled with the -n option.

filterator

filterator is a postprocessor for the SCF-B output of comparator that produces an actual code listing.

The -d acts as it does for comparator.

The -m option sets the minimum size of overlap to be reported. Setting this to a slightly larger number than the comparator shred size is a useful way to filter out spurious matches; thus, while comparator defaults to a shred size of 3, filterator defaults to a minimum size of 5.

The -f option takes a Python regexp as argument and throws out all cliques for which no associated filename matches the regexp. This might be useful if you want to see a common-segment report only for files with a particular extension such as .h or .S.

The -F option takes an argument which is a file containing a list of filenames. Any clique that does not contain one of these filenames is discarded.

The -n option suppresses code listing, so that an SCF-B output is produced instead. You can use this to filter common-segment reports with -m, -f, and -F options of the tool and then hand-filter them for junk before generating a listing.

Each filterator line beginning with % is the filename from which the following common span of lines is displayed filterator displays each span of lines, unnormalized, from the first file in the clique of matches that it can find and open.

Occasionally the filterator output looks as through the code has failed to properly merge two or more overlapping line ranges, When this happens, the ranges have different match sets; they may overlap in one tree, but not in others. A clue to this is the number of matches reported for each range.

comparator.py

comparator.py is a Python module that parses and interprets the format emitted by comparator. It can be used to write report generators. The filterator program is an example; it is a trivial wrapper around comparator.py.

AUTHOR

Eric S. Raymond . Ronald Rivest suggested the custom RXOR hash function used in versions 2.0 and up. See ESR's home page at http://www.catb.org/~esr/ for updates and other resources.

LIMITATIONS

comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law). Normalizing the code in excessively clever ways would trespass into the territory of ideas and tend to produce false positives.

The heuristic for eligible files can be fooled, though this is unlikely.

The use of hashes means there is a very small probability of false-positive matches between shreds; to be precise, with the default RXOR hash you can expect the first false positive after accumulating 6,074,000,999 shreds. For even very large forests, such as collections of Linux and Unix source trees running into the tens of millions of lines, you are very unlikely to ever see this. There is no possibility of false negatives.

The design of this program accepts some size limits to get high performance in the typical case. The RXOR hash function's performance will degrade, possibly leading to false positives, on shreds of 512 or more characters. Also, comparison will handle a maximum of 65536 lines per file (you will be warned if if this limit is exceeded). A full integer line-number range would increase working-set size by 33%.

You will get spurious results if you mix .scf files with different hash methods, shred sizes or normalizations, or with settings for these that conflict with the command-line options you have in effect for shredding source-code trees. The program does some sanity checking but will not prevent you from shooting yourself in the foot.

The implementation is O(n log n) in the size of the code trees and O(n2) in the number of common segments. The code for finding duplicate hashes uses a brute-force approach that relies on being able to allocate core for the entire hash list generated during processing. The results are almost ridiculously fast on machines with enough core (the author has observed speeds of up to two million lines of code compared per minute on a 1.8GHz Intel box), but machines with small memories will thrash their guts out as the VM subsystem tries to cope.

Comparing a directory to itself will show no similarities. This is because the algorithm ignores matching shreds from within a single directory, and it only knows which directory a shred came from by relative pathname.