comparator
[-c] [-C] [-d dir
] [-h] [-m minsize
] [-n] [-N normalization-spec
] [-o file
] [-s shredsize
] [-v] [-w] [-x] source-tree-path... filterator
[-d dir
] [-m minsize
]comparator.py
comparator is a program for quickly finding common sections in two or more source-code trees. It's named after a now-obsolete astronomical instrument called a blink comparator that was used to find moving objects in starfield photos. These reports tend to be noisy; filterator is a postprocessor for comparator reports that can apply significance filtering. comparator.py is a Python module for interpreting comparator reports; filterator uses it.
comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing a hash of each shred. The resulting list of shreds is examined and all unique hashes are thrown out. Then comparator generates a report listing all cliques of lines with duplicate hashes, with overlapping ranges merged. The output format can (not by coincidence) be stepped through with the next-error function of Emacs.
A consequence of the method is that comparator will find common code segments wherever they are. It is insensitive to rearrangements of the source trees. With appropriate options, code can be normalized at input to ignore differences in whitespace and (in C code) brace placement. An option to strip comments also exists.
The program is capable of generating a hash list for a source
code tree which can be used in place of the tree itself. Thus, it is
possible to do comparisons with source trees without having access to
the actual source code, providing someone who has access is willing to
ship you a hash list. These hash lists are called ‘SCF
files’ and made with the extension .scf
;
the name stands for Source Comparison Format.
In its normal mode of operation, comparator expects two or more command-line arguments which are the root directories of the source-code trees to be compared, and generates a report to standard output. Progress messages are emitted to standard error. Some arguments may be the names of SCF files.
When given a single path argument which is a tree, the program generates an SCF hash list to standard output.
When the -c
option is specified, the program
generates a SCF file from each tree specified on the command
line. The name of the resulting file is the argument name with
.scf
appended.
The -o
option directs output to a specified file. This
may be used with -c
to override the normal
.scf
convention, or simply as an alternative
to output redirection when generating a final report. One advantage
of -o
is that the program won't create the output
file until it is ready to write; thus, unlike shell redirects, it
won't leave an empty file lying aeound if it aborts early.
The -d
option changes current directory to the
specified tree before walking down each argument path to generate
hashes. This will be useful if you want to generate a report for
trees in a directory other than your current, but want the filenames
in the report to be relative.
The -s
option changes the shred size. Smaller
shred sizes are more sensitive to small code duplications, but produce
correspondingly noisier output. Larger ones will suppress both noise
and small similarities.
The -m
option sets the minimum-sized span
of lines that will be output. By default this is zero; setting it
to a value higher than the shred size will eliminate a lot of junk
from the report. These are two separate parameters because shred size
controls how likely common segments are to not get recognized because
of unshared things near them, while minimum span size defines the
smallest features we want to see in the output report.
Normally, comparator performs
significance filtering before emitting a span into the common-segment
report. The -n
suppresses this, emitting all common
spans. See the discussion of significance filtering below.
The option -v
enables progress and timing
messages to standard error.
The -x
enables some debugging output. It
is for developers; if you need to understand it, read the code.
comparator uses a grotty heuristic
for figuring out which files in a tree are eligible to be compared.
Files with .c
and .h
extensions are always eligible. Files with .o
or
~
extensions are always ignored. CVS, RCS,
SCCS, and .svn directories are ignored. Other files are checked to see if the
percentage of printable characters near the front of the file is above
90%.
The -N
option passes in its argument as a
normalization specification. A normalization specification consists
of a comma-separated list of tokens. The first token names a
normalizer, e.g., a set of rules for transforming code into a
canonical form. Subsequent tokens (if present) are options to the
normalizer.
The default normalizer is named by the leading token
line-oriented
. It has the following options:
The remove-whitespace
causes all whitespace in the file
(including blank lines) to be ignored for comparison purposes (line
numbers in the output report will nevertheless be correct).
The remove-braces
option strips { and }
characters. This is recommended for comparing C code; it prevents the
comparison from being fooled by differences in indent style.
The remove-comments
option removes comments in
C files, All // comments are removed, but only "winged" /* comments
with the start and end of comment on the same line are ignored; also,
block /* comments are retained, but the beginning /* and ending */ are
ignored, In other files, all # comments are removed.
Normally, comparator tries to throw out common segments that are just noise. A segment is considered noise if one of the following conditions holds:
Its file has a .c or .h extension, and the segment consists entirely of whitespace, punctuation, C keywords, C include lines, and other things that are not C identifiers, function names, or user-defined macros.
The file has a .sh extension or a #! first line directing shell interpretation, and the segment consists entirely of shell keywords and punctuation.
Significance filtering can be disabled with the
-n
option.
filterator is a postprocessor for the SCF-B output of comparator that produces an actual code listing.
The -d
acts as it does for
comparator.
The -m
option sets the minimum size of overlap
to be reported. Setting this to a slightly larger number than the
comparator shred size is a useful way to
filter out spurious matches; thus, while
comparator defaults to a shred size of 3,
filterator defaults to a minimum size of
5.
The -f option takes a Python regexp as argument and throws out all cliques for which no associated filename matches the regexp. This might be useful if you want to see a common-segment report only for files with a particular extension such as .h or .S.
The -F option takes an argument which is a file containing a list of filenames. Any clique that does not contain one of these filenames is discarded.
The -n option suppresses code listing, so that an SCF-B output is produced instead. You can use this to filter common-segment reports with -m, -f, and -F options of the tool and then hand-filter them for junk before generating a listing.
Each filterator line beginning with % is the filename from which the following common span of lines is displayed filterator displays each span of lines, unnormalized, from the first file in the clique of matches that it can find and open.
Occasionally the filterator output looks as through the code has failed to properly merge two or more overlapping line ranges, When this happens, the ranges have different match sets; they may overlap in one tree, but not in others. A clue to this is the number of matches reported for each range.
Eric S. Raymond
<esr@snark.thyrsus.com>
. Ronald Rivest suggested the
custom RXOR hash function used in versions 2.0 and up. See ESR's home page
at http://www.catb.org/~esr/ for
updates and other resources.
comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law). Normalizing the code in excessively clever ways would trespass into the territory of ideas and tend to produce false positives.
The heuristic for eligible files can be fooled, though this is unlikely.
The use of hashes means there is a very small probability of false-positive matches between shreds; to be precise, with the default RXOR hash you can expect the first false positive after accumulating 6,074,000,999 shreds. For even very large forests, such as collections of Linux and Unix source trees running into the tens of millions of lines, you are very unlikely to ever see this. There is no possibility of false negatives.
The design of this program accepts some size limits to get high performance in the typical case. The RXOR hash function's performance will degrade, possibly leading to false positives, on shreds of 512 or more characters. Also, comparison will handle a maximum of 65536 lines per file (you will be warned if if this limit is exceeded). A full integer line-number range would increase working-set size by 33%.
You will get spurious results if you mix
.scf
files with different hash methods, shred
sizes or normalizations, or with settings for these that conflict with
the command-line options you have in effect for shredding source-code
trees. The program does some sanity checking but will not prevent you
from shooting yourself in the foot.
The implementation is O(n log n) in the size of the code trees and O(n2) in the number of common segments. The code for finding duplicate hashes uses a brute-force approach that relies on being able to allocate core for the entire hash list generated during processing. The results are almost ridiculously fast on machines with enough core (the author has observed speeds of up to two million lines of code compared per minute on a 1.8GHz Intel box), but machines with small memories will thrash their guts out as the VM subsystem tries to cope.
Comparing a directory to itself will show no similarities. This is because the algorithm ignores matching shreds from within a single directory, and it only knows which directory a shred came from by relative pathname.