Synopsis
comparator [-c] [-C] [-d dir] [-h] [-m minsize] [-n] [-N normalization-spec] [-o file] [-s shredsize] [-v] [-w] [-x] source-tree-path…
filterator [-d dir] [-m minsize]
comparator.py
Description
comparator is a program for quickly finding common sections in two
or more source-code trees. It is named after a now-obsolete
astronomical instrument called a blink comparator that was used to
find moving objects in starfield photos. These reports tend to be
noisy; filterator is a postprocessor for comparator reports that
can apply significance filtering. comparator.py is a Python module
for interpreting comparator reports; filterator uses it.
How comparator works
comparator works by first chopping the specified trees into
overlapping shreds (by default 3 lines long) and computing a hash of
each shred. The resulting list of shreds is examined and all unique
hashes are thrown out. Then comparator generates a report listing
all cliques of lines with duplicate hashes, with overlapping ranges
merged. The output format can (not by coincidence) be stepped through
with the next-error function of Emacs.
A consequence of the method is that comparator will find common code
segments wherever they are. It is insensitive to rearrangements of the
source trees. With appropriate options, code can be normalized at
input to ignore differences in whitespace and (in C code) brace
placement. An option to strip comments also exists.
The program is capable of generating a hash list for a source code
tree which can be used in place of the tree itself. Thus, it is
possible to do comparisons with source trees without having access to
the actual source code, providing someone who has access is willing to
ship you a hash list. These hash lists are called 'SCF files' and made
with the extension .scf; the name stands for Source Comparison
Format.
In its normal mode of operation, comparator expects two or more
command-line arguments which are the root directories of the
source-code trees to be compared, and generates a report to standard
output. Progress messages are emitted to standard error. Some
arguments may be the names of SCF files.
When given a single path argument which is a tree, the program generates an SCF hash list to standard output.
When the -c option is specified, the program generates a SCF file
from each tree specified on the command line. The name of the
resulting file is the argument name with .scf appended.
The -o option directs output to a specified file. This may be used
with -c to override the normal .scf convention, or simply as an
alternative to output redirection when generating a final report. One
advantage of -o is that the program will not create the output file
until it is ready to write; thus, unlike shell redirects, it will not
leave an empty file lying around if it aborts early.
The -d option changes current directory to the specified tree before
walking down each argument path to generate hashes. This will be
useful if you want to generate a report for trees in a directory other
than your current, but want the filenames in the report to be
relative.
The -s option changes the shred size. Smaller shred sizes are more
sensitive to small code duplications, but produce correspondingly
noisier output. Larger ones will suppress both noise and small
similarities.
The -m option sets the minimum-sized span of lines that will be
output. By default this is zero; setting it to a value higher than the
shred size will eliminate a lot of junk from the report. These are two
separate parameters because shred size controls how likely common
segments are to not get recognized because of unshared things near
them, while minimum span size defines the smallest features we want to
see in the output report.
Normally, comparator performs significance filtering before emitting
a span into the common-segment report. The -n option suppresses
this, emitting all common spans. See the discussion of significance
filtering below.
The -v option enables progress and timing messages to standard error.
The -x option enables some debugging output. It is for developers; if you need to understand it, read the code.
comparator uses a heuristic for figuring out which files in a tree
are eligible to be compared. Files with .c and .h extensions are
always eligible. Files with .o or ~ extensions are always
ignored. CVS, RCS, SCCS, and .svn directories are ignored. Other
files are checked to see if the percentage of printable characters
near the front of the file is above 90%.
Normalization specifications
The -N option passes in its argument as a normalization
specification. A normalization specification consists of a
comma-separated list of tokens. The first token names a normalizer,
e.g., a set of rules for transforming code into a canonical
form. Subsequent tokens (if present) are options to the normalizer.
The line-oriented normalizer
The default normalizer is named by the leading token line-oriented. It has the following options:
remove-whitespace-
All whitespace in the file (including blank lines) is ignored for comparison purposes (line numbers in the output report will nevertheless be correct).
remove-braces-
Remove
{and}. This is recommended for comparing C code; it prevents the comparison from being fooled by differences in indent style. remove-comments-
Remove comments in C files. All
//comments are removed, but only "winged"/* commentswith the start and end of comment on the same line are ignored; also, block/* commentsare retained, but the beginning/and ending/are ignored. In other files, all#comments are removed.
Significance filtering
Normally, comparator tries to throw out common segments that are
just noise. A segment is considered noise if one of the following
conditions holds:
-
Its file has a
.cor.hextension, and the segment consists entirely of whitespace, punctuation, C keywords, C include lines, and other things that are not C identifiers, function names, or user-defined macros. -
The file has a
.shextension or a#!first line directing shell interpretation, and the segment consists entirely of shell keywords and punctuation.
Significance filtering can be disabled with the -n option.
filterator
filterator is a postprocessor for the SCF-B output of comparator
that produces an actual code listing.
The -d option acts as it does for comparator.
The -m option sets the minimum size of overlap to be
reported. Setting this to a slightly larger number than the
comparator shred size is a useful way to filter out spurious
matches; thus, while comparator defaults to a shred size of 3,
filterator defaults to a minimum size of 5.
The -f option takes a Python regexp as argument and throws out all
cliques for which no associated filename matches the regexp. This
might be useful if you want to see a common-segment report only for
files with a particular extension such as .h or .S.
The -F option takes an argument which is a file containing a list of
filenames. Any clique that does not contain one of these filenames is
discarded.
The -n option suppresses code listing, so that an SCF-B output is
produced instead. You can use this to filter common-segment reports
with -m, -f, and -F options of the tool and then hand-filter
them for junk before generating a listing.
Each filterator line beginning with % is the filename from which
the following common span of lines is displayed. filterator displays
each span of lines, unnormalized, from the first file in the clique of
matches that it can find and open.
Occasionally the filterator output looks as though the code has
failed to properly merge two or more overlapping line ranges. When
this happens, the ranges have different match sets; they may overlap
in one tree, but not in others. A clue to this is the number of
matches reported for each range.
comparator.py
comparator.py is a Python module that parses and interprets the
format emitted by comparator. It can be used to write report
generators. The filterator program is an example; it is a trivial
wrapper around comparator.py.
Author
Eric S. Raymond <esr@snark.thyrsus.com>. Ronald Rivest suggested the custom RXOR hash function used in versions 2.0 and up. See ESR’s home page at http://www.catb.org/~esr/ for updates and other resources.
Limitations
comparator does not attempt to do semantic analysis and catch
relatively trivial changes like renaming of variables, etc. This is
because comparator is designed not as a tool to detect plagiarism of
ideas (the subject of patent law), but as a tool to detect copying of
the expression of ideas (the subject of copyright law). Normalizing
the code in excessively clever ways would trespass into the territory
of ideas and tend to produce false positives.
The heuristic for eligible files can be fooled, though this is unlikely.
The use of hashes means there is a very small probability of false-positive matches between shreds; to be precise, with the default RXOR hash you can expect the first false positive after accumulating 6,074,000,999 shreds. For even very large forests, such as collections of Linux and Unix source trees running into the tens of millions of lines, you are very unlikely to ever see this. There is no possibility of false negatives.
The design of this program accepts some size limits to get high performance in the typical case. The RXOR hash function’s performance will degrade, possibly leading to false positives, on shreds of 512 or more characters. Also, comparison will handle a maximum of 65536 lines per file (you will be warned if this limit is exceeded). A full integer line-number range would increase working-set size by 33%.
You will get spurious results if you mix .scf files with different
hash methods, shred sizes or normalizations, or with settings for
these that conflict with the command-line options you have in effect
for shredding source-code trees. The program does some sanity checking
but will not prevent you from shooting yourself in the foot.
The implementation is O(n log n) in the size of the code trees and O(n^2) in the number of common segments. The code for finding duplicate hashes uses a brute-force approach that relies on being able to allocate core for the entire hash list generated during processing. The results are almost ridiculously fast on machines with enough core (the author has observed speeds of up to two million lines of code compared per minute on a 1.8GHz Intel box).
Comparing a directory to itself will show no similarities. This is because the algorithm ignores matching shreds from within a single directory, and it only knows which directory a shred came from by relative pathname.